Asymptotic Behaviour of the Empirical Distance Covariance for Dependent Data

We give two asymptotic results for the empirical distance covariance on separable metric spaces without any iid assumption on the samples. In particular, we show the almost sure convergence of the empirical distance covariance for any measure with finite first moments, provided that the samples form a strictly stationary and ergodic process. We further give a result concerning the asymptotic distribution of the empirical distance covariance under the assumption of absolute regularity of the samples and extend these results to certain types of pseudometric spaces. In the process, we derive a general theorem concerning the asymptotic distribution of degenerate V-statistics of order 2 under a strong mixing condition.


Introduction
In [12], Lyons introduced the concept of distance covariance for separable metric spaces, generalising the work done by Székely et al. [17]. In this very general case, the distance covariance of a measure θ (on the product space X × Y of separable metric spaces X and Y) with marginal distributions μ on X and ν on Y is defined as dcov(θ ) := δ θ (z, z ) dθ 2 (z, z ) B Marius Kroll marius.kroll@rub.de 1 Department of Mathematics, Ruhr-Universität Bochum, 44780 Bochum, Germany for z = (x, y), z = (x , y ), where δ θ (z, z ) := d μ (x, x )d ν (y, y ), To examine the properties of this object, Lyons made use of the concept of (strong) negative type. A metric space X is said to be of negative type, if there exists a mapping φ : X → H to a Hilbert space H , such that d X (x, x ) = φ(x) − φ(x ) 2 H for all x, x ∈ X . It is of strong negative type if it is of negative type and D(μ 1 − μ 2 ) = 0 if and only if μ 1 = μ 2 for all probability measures μ 1 , μ 2 with finite first moments. Lyons showed that the distance covariance is nonnegative if X and Y are of negative type, and that the property dcov(θ ) = 0 ⇔ θ = μ ⊗ ν holds if X and Y are of strong negative type.
This means that the distance covariance completely characterises independence of random variables in metric spaces of strong negative type. Estimators for the distance covariance and their asymptotic behaviour are therefore of great interest for tests of independence.
A special case for real-valued random variables follows from choosing the embedding with w d (s) = ((d + 1)/2)π −(d+1)/2 s −(d+1) 2 , which Lyons in [12] refers to as the Fourier embedding. This results in the square of the distance covariance as introduced in [17], i.e.
where ϕ Z denotes the characteristic function of a random variable Z , and the vector (X , Y ) ∈ R p+q has distribution θ .
Two of the main results of [12] are Proposition 2.6 and Theorem 2.7, which describe the asymptotic behaviour of dcov(θ n ), where θ n is the empirical measure from n iid-samples of θ . Theorem 2.7, under sufficient moment assumptions, describes the asymptotic distribution of the sequence ndcov(θ n ), if θ = μ⊗ν. Proposition 2.6 gives the almost sure convergence dcov(θ n ) a.s. − − → dcov(θ ) for any measure θ with finite first moments. However, as noted by Jakobsen in [8], Lyons' proof of Proposition 2.6 was incorrect and actually required θ to have finite 5/3-moments. Lyons later acknowledged this in [13] (iii), showing that Proposition 2.6 as written in [12] is still correct in the case of spaces of negative type, but leaving the question of whether finite first moments are sufficient in the general case of separable metric spaces unanswered. This problem was solved in [9], where the almost sure convergence is shown in the case of iid samples.
In Sect. 2, we show that one can obtain the almost sure convergence of the estimator dcov(θ n ) under finite first moment assumption while dropping the iid assumption regarding the samples which constitute the empirical measure θ n . In Theorem 1, we show the almost sure convergence of dcov(θ n ) under assumption of ergodicity and finite first moments. In Theorem 3, we give an asymptotic result similar to Theorem 2.7 in [12], assuming absolute regularity. For this we make use of Theorem 2, which is a general result concerning the asymptotic distribution of degenerate V-statistics under the assumption of α-mixing data. The definitions of α-mixing and absolute regularity are recalled at the end of this section.
A further generalisation can be achieved by raising the metrics of the underlying metric spaces to the β-th power. We will denote this with dcov β . Typically, β is chosen between 0 and 2, where the choice β = 1 results in the regular distance covariance. An equivalent way of describing this is to use the regular definitions of distance covariance, but to consider pseudometric spaces of a particular kind instead of metric spaces, namely those which result from raising some metric to the β-th power. (Here, by a pseudometric, we refer to a metric for which the triangle inequality need not hold.) In Sect. 3, we generalise the results for metric spaces deduced in Sect. 2 to pseudometric spaces of this kind.
We now summarise some of the notation used in [12], as well as some basic properties of the distance covariance that will prove useful for our purposes.
Let X and Y be random variables with values in separable metric spaces X and Y, respectively. We define Z := (X , Y ) and write θ := L(Z ), μ := L(X ) and ν := L(Y ), and denote by θ n the empirical measure of Z 1 , . . . , Z n , where (Z k ) k∈N is a strictly stationary and ergodic sequence with L(Z 1 ) = θ .
If we consider X to be of negative type via an embedding φ, we denote the Bochner integral φ dμ with β φ (μ), and we writeφ for the centred embedding φ − β φ (μ). If Y is of negative type via ψ, we define β ψ (ν) andψ analogously. If both X and Y are of negative type via embeddings φ : X → H 1 and ψ : Y → H 2 , we can consider the embedding where H 1 ⊗ H 2 is the tensor product of the Hilbert spaces H 1 and H 2 , equipped with the inner product By Proposition 3.5 in [12], we have that for all z, z ∈ X × Y, whenever X and Y are of negative type via embeddings φ and ψ, respectively. For the remainder of this paper, we will drop the indices of the metrics on X and Y and of the inner products on H 1 , H 2 or H 1 ⊗ H 2 , as it is clear from their arguments which metric or inner product we consider. More precisely, d will denote both a metric on X and a (possibly different) metric on Y, and ., . can denote one of three (possibly different) inner products on Hilbert spaces H 1 , Recall that for two σ -algebras A and B we define the αand β-coefficients of A and B as   and we say that the process (Z k ) k∈N is α-mixing or β-mixing if α(n) − −−→ n→∞ 0 or β(n) −−−→ n→∞ 0, respectively. β-mixing is also known as absolute regularity. These definitions are taken from [4], where many properties of α-mixing and absolutely regular processes are established.

Results for Metric Spaces
We now present our results in the case of separable metric spaces. It should be kept in mind that while we consider the usual distance correlation, Theorems 1 and 3 also hold for dcov β (under appropriate moment conditions). However, we postpone discussion of this until Sect. 3, so as to avoid confusion by abstraction.
The following lemma is a variant of Theorem 3.5 in [3], where it is formulated for random variables. (2) Furthermore, we require h to be dominated by some μ-integrable function g, i.e. |h| ≤ g μ-a.s. Then h dμ n → h dμ.
Proof Without loss of generality, suppose that X is a metric space. We can decompose the integral with respect to μ n into a truncated part and a tail part: The truncated integral converges, because it is the integral of an almost surely continuous and bounded function and μ n ⇒ μ, while the uniform integrability condition (2) implies that the tail integral vanishes in the limit M, n → ∞. More precisely, we have the inequality The second summand vanishes by assumption due to (2). For the first summand, note that for any fixed M, the limes superior in n of the integral converges to In proving Theorem 1, we will make use of the following general result, which is a generalisation of Theorem U (ii) from [1].

Lemma 2
Let (Z k ) k∈N be a strictly stationary and ergodic process with values in a separable metrisable topological space Z and marginal distribution L(Z 1 ) = θ . Let h : Z d → R be a measurable function, and let f : Z → R be integrable with respect to θ , so that |h| ≤ f ⊗ · · · ⊗ f , where the product denoted by ⊗ is taken d times and Proof Without loss of generality, suppose that Z is a metric space. Let θ n := n −1 n k=1 δ Z k denote the empirical measure of Z 1 , . . . , Z n . We have the represen- Furthermore, θ n ⇒ θ a.s., since Z is separable, and therefore θ d n ⇒ θ d a.s. by Theorem 2.8 (ii) in [3]. We now wish to employ Lemma 1. Hence, we need to show that the sequence of integrals fulfills the following uniform integrability condition: and which, due to Birkhoff's pointwise ergodic theorem, almost surely converges to since f is assumed to be integrable. Lemma 1 therefore gives us Note that the following result does not require any assumptions beyond the separability of the metric spaces X and Y and the ergodicity of the samples generating the empirical measure θ n . Thus, Proposition 2.6 in [12] and Theorem 4.4 in [9], both of which require iid samples, are consequences of our result.
Theorem 1 Let X and Y be random variables with values in separable metric spaces X and Y, respectively, and Z := (X , Y ). Write θ := L(Z ), μ := L(X ) and ν := L(Y ), and denote by θ n the empirical measure of Z 1 , . . . , Z n , where (Z k ) k∈N is a strictly stationary and ergodic sequence with L( Proof We follow the idea of the proof of Proposition 2.6 in [12]. Consider the symmetric kernelh, defined as the symmetrisation of h, where As shown in the proof of Proposition 2.6 in [12], we have , and write ϕ for the maximum over all these ϕ i . Using (4), this gives us The functions ϕ i are continuous and measurable, since the underlying metric spaces are separable. They are also integrable because X and Y are assumed to have finite first moments. Using Lemma 2 therefore gives us Vh(Z 1 , . . . , Z n ) → h dθ 6 almost surely, where Vh(Z 1 , . . . , Z n ) denotes the V -statistics with kernelh. Since the Vstatistics with kernelh are equal to dcov(θ n ), and h dθ 6 = dcov(θ ) (cf. [12]), this is what we wanted to show.
where (λ k , ϕ k ) are pairs of the nonnegative eigenvalues and matching eigenfunctions of the integral operator and (ζ k ) k∈N is a sequence of centred Gaussian random variables whose covariance structure is given by Proof We note that the conditions of Theorem 2 in [16] are satisfied by Propositions 1-3 and Assumption 1 ibid., the latter of which is a consequence of E|h(Z 1 , for all z, z ∈ supp(θ ). The ϕ k are centred and form an orthonormal basis of L 2 (θ ). Adopting the notation V (K ) for the V -statistics for the truncated kernel Using the Cramér-Wold theorem, we will now show that, for any K ∈ N, (ζ n,k ) 1≤k≤K weakly converges to (ζ k ) 1≤k≤K , where the ζ k are centred Gaussian variables with their covariances given in (5).
Let c 1 , . . . , c K be real constants and set ξ t := K k=1 c k ϕ k (Z t ). Then the ξ t are centred random variables with Here, we have used the Cauchy-Schwarz inequality and the fact that the eigenfunctions ϕ k form an orthonormal basis of L 2 (θ ). This gives us by Jensen's inequality, which implies ϕ k 2+ε ≤ λ −1 k h 2+ε . Since our kernel h has finite (2 + ε)-moments by assumption, this property translates to the eigenfunctions ϕ k . Using Theorem 3.7 and Remark 1.8 in [4] therefore gives us Thus, with S n denoting the sum over ξ 1 , . . . , ξ n , we have that where we have made use of the stationarity of the process (Z k ) k∈N and the fact that the eigenfunctions ϕ k form an orthonormal basis of L 2 . If ζ 1 , . . . , ζ K are Gaussian random variables with their covariance function given by (5), the limit σ 2 is the variance of the linear combination K k=1 c k ζ k . We now show the uniform integrability of the sequence (S 2 n σ −2 n ) n∈N . It suffices to show that E|S n σ −1 n | 2+δ is uniformly bounded in n for some δ > 0. Since h has finite (2 + ε)-moments, we get Here, we have made use of (6) and the stationarity of the sequence (Z n ), which ensures that the upper bound M(ε) is indeed uniform in n. Since α(n) = O(n −r ) with r > 1 + 2ε −1 and σ n has rate of growth θ( √ n), Theorem 2.1 in [15] gives us E|S n σ −1 n | 2+δ = O(1) for some δ > 0. This implies uniform integrability of (S 2 n σ −2 n ) n∈N . Using Theorem 10.2 from [4] therefore gives us and so, by the Cramér-Wold theorem, the vectors (ζ n,k ) 1≤k≤K converge to Gaussian vectors (ζ k ) 1≤k≤K with the covariance stucture described in (5) for any K ∈ N. Now, applying the continuous mapping theorem gives us and the summability of the eigenvalues λ k , which is due to the identity ∞ We will now show that We consider the Hilbert space H of all real-valued sequences (a k ) k∈N for which the series k λ k a 2 k converges, equipped with the inner product given by (a k ), Here, we define the covariance of two H -valued random variables X and Y as the real number Cov(X , Y ) := E X , Y H − EX , EY H . We aim to employ a covariance inequality for Hilbert-space valued random variables. For this, let us first consider the (2 + ε)-moments of T K (Z 1 ). For any p > 0, we get Since h has finite (1 + ε 2 )-moments on the diagonal by assumption, this implies the (2 + ε)-integrability of T K (Z 1 ).
Lemma 2.2 in [7] and the stationarity of the process (Z t ) t∈N therefore give us and we have shown before that n −1 n s,t=1 α(|s − t|) ε/(2+ε) converges to a finite limit c. Furthermore, from

Lemma 3 If (X k ) k∈N is a strictly stationary sequence of random variables whose marginal distribution μ has finite q-moments, then there exists an upper bound M ∈ R such that, for any collection of indices i
for any p < q, where f is the function from the proof of Theorem 1.
Proof First, consider any two indices i 1 , i 2 . Then, due to (17), we have where x 0 is some arbitrary point in X . Now, let i 1 , . . . , i 4 be fixed but arbitrary indices. Then, with a similar bound to the one used in Lemma 5, We use Lemma 1 from [18] for the function h(x 1 , . . . , x 4 (10). Thus, Lemma 1 in [18] gives us where β(n) is the β-mixing coefficient of the sequence (Z k ) k∈N . Because β(n) ≤ 1 for all n ∈ N, (10), (11) and (12) give us The following lemma is an adaptation of Lemma 2 in [18] in the sense that our result is implicitly contained in their proof. Another variant of this lemma (for U-statistics) can be found in [2]. Since both of these lemmas are slightly different from our version, we include a proof for the sake of completeness. However, it should be noted that all three proofs apply the same technique.

Lemma 4 Let h be a symmetric and degenerate kernel of order c ≥ 2.
Here, we understand degeneracy as Eh(z 1 , . . . , z c−1 , Z c ) = 0 almost surely. If, for some p > 2, the p-th moments of h(Z i 1 , . . . , Z i c ) are uniformly bounded and (Z n ) n∈N is strictly stationary and absolutely regular with mixing coefficients Proof We will follow the basic idea of the proof of Lemma 2 in [18]. First, consider the special case of c = 2. We have Now due to the degeneracy of our kernel h, we can employ Lemma 1 in [18] to obtain Here, M is some constant uniform in i 1 , . . . , i 4 and n.
Let us first assume that k := |i 2 − i 1 | ≥ |i 4 − i 3 | =: l. For any fixed value of k, we have at most 2(n − k) possible values for i 1 . Furthermore, since k ≥ l ≥ 0, we have k + 1 possible values for l and, for any fixed l, at most 2(n − l) possible values for i 3 . Writing The sum converges due to our assumptions on β(n). The same bound can be established for the cases where |i 4 − i 3 | ≥ |i 2 − i 1 |. The only combinations missing are those where (i 1 , i 2 ) = (i 3 , i 4 ), of which there are n 2 . We can combine these results to get which proves the lemma in the case c = 2.
The proof for arbitrary c follows the same idea. We then obtain an upper bound of which again is O(n c ) due to our bounds on β(n). Suppose that X and Y are of negative type via mappings φ and ψ, respectively, and that X × Y is σ -compact. If X and Y are independent, have finite (1 + ε)-moments for some ε > 0, and the sequence (Z k ) k∈N is absolutely regular with mixing coefficients β(n) = O(n −r ) for some r > 6(1 + 2ε −1 ), then

Theorem 3 Let X and Y be random variables with values in separable metric spaces
where the ζ k are centred Gaussian random variables whose covariance function given in (5) is determined by the dependence structure of the sequence (Z k ) k∈N , and the parameters λ k > 0 are determined by the underlying distribution θ .
for 0 ≤ c ≤ 6. It can be readily seen that under the assumption of independence of X and Y ,h 1 = 0 almost surely, and so the Hoeffding decomposition reduces to We will show that the kernelh 2 satisfies the conditions of Theorem 2 and that, under our assumptions, Application of some algebra shows thath 2 = δ θ /15, proceeding in the following way: It can be easily checked that under independence of X and Y ,h is a degenerate kernel, since integrating over all but one argument of f (with respect to either of the marginal distributions of θ ) yields a function which is 0 almost surely. Therefore, where S 6 is the symmetric group of all permutations operating on {1, . . . , 6}. Notice that the summands are equal to δ θ (z σ (1) , z σ (2) ) if σ (1), σ (2) ∈ {1, 2}. This follows directly from the definitions of d μ and d ν . Moreover, 1 and 2 are the only indices appearing in both f (X 1 , . . . , X 4 ) and f (Y 1 , Y 2 , Y 5 , Y 6 ), so any permutation σ with σ (1), σ (2) / ∈ {1, 2} results in taking the integral of f over all or all but one argument, either with respect to μ or with respect to ν. But we have seen before that these integrals are 0 almost surely, and so, due to the independence of X and Y , the same is true for the integral of h with respect to θ .
We will now prove (14). For this, we will first note that under our assumptions, the kernelh has finite (2 + ε)-moments with respect to θ 6 . This can be seen with a similar approach as in the proof of Lemma 5. Furthermore, Lemma 3 together with the independence of X and Y gives us the existence of an upper bound M ∈ R such that for any collection of indices 1 ≤ i 1 , . . . , i 6 ≤ n.
Employing Lemma 4 therefore gives us for all c ≥ 2. Now, together with (13), we have This implies (14), which together with (15) proves the Theorem.

Remark 1
It would be desirable to achieve a result similar to Theorem 3 under the assumption of just α-mixing. For example, Theorem 3.2 in [5] gives such a result under the supposition that X and Y are real-valued random vectors. For our more general setting of (pseudo-)metric spaces, one only needs to show that (14) still holds in the case of α-mixing, since Theorem 2 does not require absolute regularity. We consider it likely that this can indeed be derived from the amicable properties of the distance covariance.

Generalisation to Pseudometric Spaces
Let (X , d) be a metric space and consider d β for β ∈ (0, 2]. Then d β is a pseudometric, i.e. the triangle inequality does not necessarily hold for d β . We will develop parts of the theory of [12] for pseudometric spaces of this particular kind, which we will refer to as β-pseudometric spaces. This is of interest if one considers dcov β , a generalisation of the usual distance covariance, which results from using the β-th power of the metrics on X and Y for the definition of d μ and d ν . That is, dcov β with respect to (X , d) and (Y, d) is equivalent to the regular distance covariance with respect to the β-pseudometric spaces (X , d β ) and (Y, d β ). Obviously, for any constant β > 0, d β induces the same topology (and thus, the same Borel σ -algebra) as the original metric d. This means that any β-pseudometric space is a metrisable topological space.
This approach of viewing dcov β not as a different object on the same space, but as the same object on a different space might not be very intuitive at first. However, since the concept of (strong) negative type does not require a metric space, this characterisation allows us to still use the relation between (strong) negative type of the underlying space and the distance covariance. This leads to the question of whether (X , d β ) is of (strong) negative type, given the original metric space (X , d), for which some criteria are known-see for example Corollary 3 or, more generally, [11] and [14].
Note that if β ∈ (0, 1], d β is indeed still a metric, and we can rely on the already developed theory for separable metric spaces. Thus, we get the following result. Proof Theorem 1 follows immediately. For Theorem 2, we note that d β induces the same Borel σ -algebra as d. Furthermore, by Remark 3.19 in [12], the resulting metric spaces are still of negative type. For β ∈ (1, 2), while we cannot rely on the triangle inequality, the Jensen inequality gives us a result which we will call the weak triangle inequality. Specifically, for any β ∈ [1,2]: for all x, x , x 0 ∈ X . This can be further bounded by replacing the factor 2 β−1 by 2. Like in the metric case, we say that a probability measure μ has finite first moment if there exists an element x 0 ∈ X such that d(x, x 0 ) dμ(x) < ∞. Again, the choice of x 0 is arbitrary due to the weak triangle inequality. Thus, we can define the objects a μ , D(μ) and d μ as in the metric case.

Lemma 5
If μ has finite β p-moment, then d (β) μ has finite 2 p-moment with respect to μ 2 and finite p-moment on the diagonal for any p ≥ 1.
We can now define δ θ and dcov(θ ) analogously to the metric case. Since the relevant proofs do not make use of the triangle inequality, it follows from [12] that for pseudometric spaces of strong negative type θ = μ ⊗ ν if and only if dcov(θ ) = 0. This, together with the next Lemma, gives a very easy proof of Theorem 4.2 in [6].  2 for all x, x ∈ H , which implies that (H , . β ) is of negative type. By Remark 3.19 in [12] (which, along with all its auxiliary results, also holds for pseudometric spaces), the space (H , . β ) therefore has strong negative type for all β ∈ (0, 2).
We can use this Lemma to adapt Corollary 5.9 from [11].

Corollary 3
Let (X , d) be a metric space. If there exists an isometric embedding from X into a separable Hilbert space H , then (X , d β ) is of negative type for all β ∈ (0, 2] and of strong negative type for all β ∈ (0, 2).
Proof Fix β ∈ (0, 2], and let ϕ : X → L 2 [0, 1] be an isometric embedding. By Lemma 6,(H ,. β H ) is of negative type via some embedding , which implies that (X , d β ) is of negative type via ( • ϕ). If β < 2, then (H , . β H ) is of strong negative type, and so, for any two probability measures μ 1 , μ 2 on X , we have that where μ ϕ i denotes the pushforward of μ i via ϕ. We can extend the last integral to the entire space H , because the pushforward measures vanish on ϕ(X ) C . Using the strong negative type of (H , . β H ), this gives us μ ϕ 1 = μ ϕ 2 , which implies μ 1 = μ 2 , since ϕ is injective. β ∈ (1, 2). Then, if we replace the finite first moment condition of Theorem 1 by a finite β-moment assumption, Theorem 1 still holds for dcov β . If we furthermore assume X and Y to be isometrically embeddable into separable Hilbert spaces, and replace the finite (1 + ε)-condition with a finite (1 + ε)β-moment assumption, then Theorem 3 still holds for dcov β .

Corollary 4 Let
Proof We first consider Theorem 1. We can replace (4) by |h(z 1 , . . . , z 6 )| ≤ 16d β (x 2 , x 3 )d β (y 1 , y 4 ) as we have done in the proof of Lemma 5. This changes the original bound only by constant, which does not affect the remainder of the proof.
If X and Y are isometrically embeddable into separable Hilbert spaces, then by Corollary 3 the spaces resulting from raising their metrics to the power β are of negative type. By Lemma 5, the proof of Theorem 3 still holds for β-pseudometric spaces. We can therefore apply Theorem 3 to the spaces (X , d β ) and (Y, d β ).

Further Work
The limiting distribution established in Theorem 3 is dependent both on the marginal distribution θ (through the eigenvaleus λ k ) and the dependence structure of the process (Z k ) k∈N (through the Gaussian process (ζ k ) k∈N ). Thus, one cannot directly use this result to construct a test of independence, since the critical values of this test would in general be unknown.
Such a dependence of the limiting distribution on unknown parameters is not unusual-indeed, in the iid case, there are many well-established ways to approximate the asymptotic distribution of a random variable, even if it may depend on unknown parameters. The authors of [17], for instance, propose a permutation test to approximate the asymptotic distribution of the distance covariance for real-valued iid data.
In the case of dependent data, such as we have examined in this paper, one cannot employ methods that would alter the dependence structure of the original sequence (Z k ) k∈N , since this in turn would result in a different Gaussian process (ζ k ) k∈N and thus a different limiting distribution. A feasible approach might be a type of block bootstrap (cf. [10], sections 2.5-2.7), where the resampling occurs from a collection of blocks, each consisting of a certain number of consecutive observations, thus leaving the dependence structure of the original process unchanged. We are currently working on proving the consistency of such a block bootstrap for the distance covariance.