Asymptotic Behaviour of the Empirical Distance Covariance for Dependent Data

We give two asymptotic results for the distance covariance on separable metric spaces without any iid assumptions. In particular, we show the almost sure convergence of the empirical distance covariance under ergodicity for any measure with finite first moments. We further give a result concerning the asymptotic distribution of the empirical distance covariance under the assumption of absolute regularity and extend these results to certain types of pseudometric spaces. In the process, we derive a general theorem concerning the asymptotic distribution of degenerate V-statistics of order 2 under a strong mixing condition.


Introduction
In [10], Lyons introduced the concept of distance covariance for separable metric spaces, generalising the work done by Székely, Rizzo and Bakirov in [15]. In this very general case, the distance covariance of a measure θ (on the product space X × Y of separable metric spaces X and Y) with marginal distributions µ on X and ν on Y is defined as dcov(θ) := δ θ (z, z ′ ) dθ 2 (z, z ′ ) for z = (x, y), z ′ = (x ′ , y ′ ), where To examine the properties of this object, Lyons made use of the concept of (strong) negative type. A metric space X is said to be of negative type, if there exists a mapping φ : X → H to a Hilbert space H, such that 2 for all x, x ′ ∈ X . It is of strong negative type if it is of negative type and D(µ 1 − µ 2 ) = 0 if and only if µ 1 = µ 2 for all probability measures µ 1 , µ 2 with finite first moments. Lyons showed that the distance covariance is non-negative if X and Y are of negative type, and that the property dcov(θ) = 0 ⇔ θ = µ ⊗ ν holds if X and Y are of strong negative type.
This means that the distance covariance completely characterises independence of random variables in metric spaces of strong negative type. Estimators for the distance covariance and their asymptotic behaviour are therefore of great interest for tests of independence.
Two of the main results of [10] are Proposition 2.6 and Theorem 2.7, which describe the asymptotic behaviour of dcov(θ n ), where θ n is the empirical measure from n iid-samples of θ. Theorem 2.7, under sufficient moment assumptions, describes the asymptotic distribution of the sequence ndcov(θ n ), if θ = µ ⊗ ν. Proposition 2.6 gives the almost sure convergence dcov(θ n ) a.s. − − → dcov(θ) for any measure θ with finite first moments. However, as noted by Jakobsen in [8], Lyons' proof of Proposition 2.6 was incorrect and actually required θ to have finite 5/3-moments. Lyons later acknowledged this in [11] (iii), showing that Proposition 2.6 as written in [10] is still correct in the case of spaces of negative type, but leaving the question of whether finite first moments are sufficient in the general case of separable metric spaces unanswered.
In Section 2, we show that one can indeed obtain the almost sure convergence of the estimator dcov(θ n ) under finite first moment assumption. We further generalise the theory presented in [10] in that we drop the iid assumption regarding the samples which constitute the empirical measure θ n . In Theorem 1, we show the almost sure convergence of dcov(θ n ) under assumption of ergodicity and finite first moments. In Theorem 3, we give an asymptotic result similar to Theorem 2.7 in [10], assuming absolute regularity. For this we make use of Theorem 2, which is a general result concerning the asymptotic distribution of degenerate V-statistics under the assumption of α-mixing data. For definitions of α-mixing and absolute regularity, as well as basic properties of these mixing conditions, see [4].
A further generalisation can be achieved by raising the metrics of the underlying metric spaces to the β-th power. We will denote this with dcov β . Typically, β is chosen between 0 and 2, where the choice β = 1 results in the regular distance covariance. An equivalent way of describing this is to use the regular definitions of distance covariance, but to consider pseudometric spaces of a particular kind instead of metric spaces, namely those which result from raising some metric to the β-th power (here, by a pseudometric we refer to a metric for which the triangle inequality need not hold). In Section 3, we generalise the results for metric spaces deduced in Section 2 to pseudometric spaces of this kind.
We now summarise some of the notation used in [10], as well as some basic properties of the distance covariance that will prove useful for our purposes.
Let X and Y be random variables with values in separable metric spaces X and Y, respectively. Let θ = L(X, Y ) denote the distribution of Z := (X, Y ), and θ n the empirical distribution of Z 1 , ..., Z n ∼ θ, where (Z k ) k∈N is a strictly stationary and ergodic sequence with L(Z 1 ) = θ. The variables Z k have a representation Z k = Ψ • T k for a random variable Ψ with values in X × Y and an ergodic, measure-preserving function T on the underlying probability space, the probability measure of which we will refer to as ρ.
If we consider X to be of negative type via an embedding φ, we denote the Bochner integral φ dµ with β φ (µ), and we writeφ for the centered embedding φ − β φ (µ). If both X and Y are of negative type via embeddings φ : X → H 1 and ψ : Y → H 2 , we can consider the embedding where H 1 ⊗ H 2 is the tensor product of the Hilbert spaces H 1 and H 2 , equipped with the inner product We do not adopt different notations for the metrics of X and Y or the inner products on H 1 , H 2 or H 1 ⊗ H 2 , as it is clear from their arguments which metric or inner product we consider.
By Proposition 3.5 in [10], we have that for all z, z ′ ∈ X × Y, whenever X and Y are of negative type via embeddings φ and ψ, respectively.

Results for metric spaces
We now present our results in the case of separable metric spaces. It should be kept in mind that while we consider the usual distance correlation, Theorems 1 and 3 also hold for dcov β (under appropriate moment conditions). However, we postpone discussion of this until Section 3, so as to avoid confusion by abstraction.

Lemma 1.
Let X be a metrizable topological space, (µ n ) n∈N a sequence of measures on X with weak limit µ and h : X → R a µ-a.s. continuous function which fulfills the following uniform integrability condition: Furthermore, we require h to be dominated by some µ-integrable function g, i.e. |h| ≤ g µ-a.s. Then h dµ n → h dµ.
Proof. Without loss of generality, suppose that X is a metric space. We can decompose the integral with respect to µ n into a truncated part and a tail part: The truncated integral converges, because it is the integral of an almost surely continuous and bounded function and µ n ⇒ µ, while the uniform integrability condition (2) implies that the tail integral vanishes in the limit M, n → ∞. More precisely, we have the inequality The second summand vanishes by assumption due to (2 In proving Theorem 1, we will make use of the following general result, which is a generalisation of Theorem U (ii) from [1].

Lemma 2.
Let (X k ) k∈N be a strictly stationary and ergodic process with values in a separable metrizable topological space X and marginal distribution L(X 1 ) = θ. Let h : X d → R be a measurable function, and let f : X → R be integrable with respect to θ, so that |h| ≤ f ⊗...⊗f , where the product denoted by ⊗ is taken d times and denotes the V -statistics with kernel h, and the convergence holds with respect to ρ.
Proof. Without loss of generality, suppose that X is a metric space. Let , since X is separable, and therefore θ d n ⇒ θ d a.s. by Theorem 2.8 (ii) in [3]. We now wish to employ Lemma 1. Hence, we need to show that the sequence of integrals fulfills the following uniform integrability condition: since f is assumed to be integrable. Lemma 1 therefore gives us where the almost sure convergence holds with respect to the measure ρ on the underlying probability space.
Note that the following result does not require any assumptions beyond the separability of the metric spaces X and Y and the ergodicity of the samples generating the empirical measure θ n . Thus, Proposition 2.6 in [10], requiring iid samples, is a special case of our result. It follows that the original assumption of first moments is indeed correct.
where the almost sure convergence holds with respect to ρ.
Proof. We follow the idea of the proof of Proposition 2.6 in [10]. Consider the symmetric kernelh, defined as the symmetrisation of h, where As shown in the proof of Proposition 2.6 in [10], we have Let z 0 = (x 0 , y 0 ) be an arbitrary but fixed point in X × Y. Since a + b ≤ ab for all real a, b ≥ 2, we have 3 and as 2 ∨ d(y, y 0 ) if i = 1, 6, and write ϕ for the maximum over all these ϕ i . Using (4), this gives us The functions ϕ i are continuous and measurable, since the underlying metric spaces are separable. They are also integrable because X and Y are assumed to have finite first moments. Using Lemma 2 therefore gives us Vh(Z 1 , ..., Z n ) → h dθ 6 almost surely with respect to ρ, where Vh(Z 1 , ..., Z n ) denotes the V -statistics with kernelh. Since the V -statistics with kernelh are equal to dcov(θ n ), and h dθ 6 = dcov(θ) (cf. [10]), this is what we wanted to show.
Consider a continuous, symmetric, degenerate and positive semidefinite kernel h : Z 2 → R with finite (2 + ε)-moments with respect to θ 2 and finite

pairs of the non-negative eigenvalues and matching eigenfunctions of the integral operator
and (ζ k ) k∈N is a sequence of centered Gaussian random variables whose covariance structure is given by Proof. We note that the conditions of Theorem 2 in [14] are satisfied by Propositions 1-3 and Assumption 1 ibid., the latter of which is a consequence of E|h(Z 1 , for all z, z ′ ∈ supp(θ). The ϕ k are centered and form an orthonormal basis of L 2 (θ). Writing V (K) for the V -statistics for the truncated kernel Using the Cramér-Wold theorem, we will now show that, for any K ∈ N, (ζ n,k ) 1≤k≤K weakly converges to (ζ k ) 1≤k≤K , where the ζ k are centered Gaussian variables with their covariances given in (5).
Let c 1 , ..., c K be real constants and set ξ t := K k=1 c k ϕ k (Z t ). Then the ξ t are centered random variables with Here, we have used the Cauchy-Schwarz inequality and the fact that the eigenfunctions ϕ k form an orthonormal basis of L 2 (θ). This gives us by Jensen's inequality, which implies ϕ k 2+ε ≤ λ −1 k h 2+ε . Since our kernel h has finite (2 + ε)-moments by assumption, this property translates to the eigenfunctions ϕ k . Using Theorem 3.7 and Remark 1.8 in [4] therefore gives us for all 1 ≤ k, l ≤ K, where C is a positive constant depending on the corresponding eigenfunctions and -values. From this and the fact that α(n) = O(n −r ) with r > 1 + 2ε −1 it follows that, for any k, l, the series ∞ d=1 Cov(ϕ k (Z 1 ), ϕ l (Z 1+d )) and lim n n −1 n−1 d=1 dCov(ϕ k (Z 1 ), ϕ l (Z 1+d )) converge, since d/n < 1 for all 1 ≤ d < n. Thus, with S n denoting the sum over ξ 1 , ..., ξ n , we have that where we have made use of the stationarity of the process (Z k ) k∈N and the fact that the eigenfunctions ϕ k form an orthonormal basis of L 2 . If ζ 1 , ..., ζ K are Gaussian random variables with their covariance function given by (5), the limit σ 2 is the variance of the linear combination K k=1 c k ζ k . We now show the uniform integrability of the sequence (S 2 n σ −2 n ) n∈N . It suffices to show that E|S n σ −1 n | 2+δ is uniformly bounded in n for some δ > 0.
Since h has finite (2 + ε)-moments, we get Here, we have made use of (6) and the stationarity of the sequence (Z n ), which ensures that the upper bound M (ε) is indeed uniform in n. Since α(n) = O(n −r ) with r > 1+2ε −1 and σ n has rate of growth θ( √ n), Theorem 2.1 in [13] gives us E|S n σ −1 n | 2+δ = O(1) for some δ > 0. This implies uniform integrability of (S 2 n σ −2 n ) n∈N . Using Theorem 10.2 from [4] therefore gives us and so, by the Cramér-Wold theorem, the vectors (ζ n,k ) 1≤k≤K converge to Gaussian vectors (ζ k ) 1≤k≤K with the covariance stucture described in (5) for any K ∈ N. Now, applying the continuous mapping theorem gives us and the summability of the eigenvalues λ k , which is due to the identity We will now show that We consider the Hilbert space H of all real-valued sequences (a k ) k∈N for which the series k λ k a 2 k converges, equipped with the inner product given by (a k ), Here, we define the covariance of two H-valued random variables X and Y as the real number Cov(X, Y ) := E X, Y H − EX, EY H . We aim to employ a covariance inequality for Hilbert-space valued random variables. For this, let us first consider the (2 + ε)-moments of T K (Z 1 ). For any p > 0, we get Since h has finite (1 + ε 2 )-moments on the diagonal by assumption, this implies the (2 + ε)-integrability of T K (Z 1 ). Lemma 2.2 in [7] and the stationarity of the process (Z t ) t∈N therefore gives us 2+ε α(|s − t|) ε/(2+ε) and we have shown before that n −1 n s,t=1 α(|s − t|) ε/(2+ε) converges to a finite limit c. Furthermore, from Putting all of the above together, we get for any p < q, where f is the function from the proof of Theorem 1.
Proof. First, consider any two indices i 1 , i 2 . Then, due to (17), we have where x 0 is some arbitrary point in X . Now, let i 1 , ..., i 4 be fixed but arbitrary indices. Then, with a similar bound to the one used in Lemma 5, We use Lemma 1 from [16] for the function h(x 1 , ..., 4 ) p and the reordered collection (i 2 , i 3 , i 1 , i 4 ). Their assumptions are satisfied with δ := q p − 1, because due to (10). Thus, Lemma 1 in [16] gives us where β(n) is the β-mixing coefficient of the sequence (Z k ) k∈N . Because β(n) ≤ 1 for all n ∈ N, (10), (11) and (12) give us The following lemma is an adaptation of Lemma 2 in [16] in the sense that our result is implicitly contained in their proof. Another variant of this lemma (for U-statistics) can be found in [2]. Since both of these lemmas are slightly different from our version, we include a proof for the sake of completeness. However, it should be noted that all three proofs apply the same technique. Here, we understand degeneracy as Eh(z 1 , ..., z c−1 , Z c ) = 0 almost surely. If,  for some p > 2, the p-th moments of h(Z i 1 , ..., Z ic ) are uniformly bounded and (Z n ) n∈N is absolutely regular with mixing coefficients Proof. We will follow the basic idea of the proof of Lemma 2 in [16]. First, consider the special case of c = 2. We have Now due to the degeneracy of our kernel h, we can employ Lemma 1 in [16] to obtain The sum converges due to our assumptions on β(n). The same bound can be established for the cases where |i 4 − i 3 | ≥ |i 2 − i 1 |. The only combinations missing are those where (i 1 , i 2 ) = (i 3 , i 4 ), of which there are n 2 . We can combine these results to get which proves the lemma in the case c = 2.
The proof for arbitrary c follows the same idea. We then obtain an upper bound of which again is O(n c ) due to our bounds on β(n).

Theorem 3.
Suppose that X and Y are of negative type via mappings φ and ψ, respectively, and that X × Y is σ-compact. If X and Y are independent, have finite (1 + ε)-moments for some ε > 0, and the sequence (Z k ) k∈N is absolutely regular with mixing coefficients β(n) = O(n −r ) for some r > 6(1 + 2ε −1 ), then where the ζ k are centered Gaussian random variables whose covariance function given in (5) is determined by the dependence structure of the sequence (Z k ) k∈N , and the parameters λ k > 0 are determined by the underlying distribution θ.
Proof. Consider the identity dcov(θ n ) = Vh(Z 1 , ..., Z n ) =: V as given in Theorem 1. We will employ Hoeffding decomposition, i.e. for 0 ≤ c ≤ 6. It can be readily seen that under the assumption of independence of X and Y ,h 1 = 0 almost surely, and so the Hoeffding decomposition reduces to We will show that the kernelh 2 satisfies the conditions of Theorem 2 and that, under our assumptions, Application of some algebra shows thath 2 = δ θ /15, proceeding in the following way: It can be easily checked that under independence of X and Y ,h is a degenerate kernel, since integrating over all but one argument of f (with respect to either of the marginal distributions of θ) yields a function which is 0 almost surely. Therefore, where S 6 is the symmetric group of all permutations operating on {1, ..., 6}.
Notice that the summands are equal to δ θ (z σ(1) , z σ(2) ) if σ(1), σ(2) ∈ {1, 2}. This follows directly from the definitions of d µ and d ν . Moreover, 1 and 2 are the only indices appearing in both f (X 1 , ..., X 4 ) and f (Y 1 , Y 2 , Y 5 , Y 6 ), so any permutation σ with σ(1), σ(2) / ∈ {1, 2} results in taking the integral of f over all or all but one argument, either with respect to µ or with respect to ν. But we have seen before that these integrals are 0 almost surely, and so, due to the independence of X and Y , the same is true for the integral of h with respect to θ.
We will now prove (14). For this, we will first note that under our assumptions, the kernelh has finite (2 + ε)-moments with respect to θ 6 . This can be seen with a similar approach as in the proof of Lemma 5. Furthermore, Lemma 3 together with the independence of X and Y gives us the existence of an upper bound M ∈ R such that for any collection of indices 1 ≤ i 1 , ..., i 6 ≤ n.
Employing Lemma 4 therefore gives us for all c ≥ 2. Now, together with (13), we have This implies (14), which together with (15) proves the Theorem.

Corollary 1. Under the assumptions of Theorem 3, we have
θ is not the product measure of its marginal distributions µ and ν, the left hand side converges to ∞ almost surely.
Proof. We have the identity D(µ n ) = n −2 n k,l=1 d(X k , X l ), and thus by Lemma 2 D(µ n ) a.s. − − → D(µ). The same holds for D(ν n ), and thus the convergence in distribution follows with the Slutsky theorem. Since D(µ)D(ν) = Eδ θ (Z 1 , Z 1 ) = ∞ k=1 λ k , the expected value of the limiting distribution is equal to 1.
Remark. It would be desirable to achieve a result similar to Theorem 3 under the assumption of just α-mixing. For example, Theorem 3.2 in [5] gives such a result under the supposition that X and Y are real-valued random vectors. For our more general setting of (pseudo-)metric spaces, one only needs to show that (14) still holds in the case of α-mixing, since Theorem 2 does not require absolute regularity. We consider it likely that this can indeed be derived from the amicable properties of the distance covariance.

Generalisation to pseudometric spaces
Let (X , d) be a metric space and consider d β for β ∈ (0, 2]. Then d β is a pseudometric, i.e. the triangle inequality does not necessarily hold for d β . We will develop parts of the theory of [10] for pseudometric spaces of this particular kind, which we will refer to as β-pseudometric spaces. This is of interest if one considers dcov β , a generalisation of the usual distance covariance, which results from using the β-th power of the metrics on X and Y for the definition of d µ and d ν . That is, dcov β with respect to (X , d) and (Y, d) is equivalent to the regular distance covariance with respect to the β-pseudometric spaces (X , d β ) and (Y, d β ). Obviously, for any constant β > 0, d β induces the same topology (and thus, the same Borel σ-algebra) as the original metric d. This means that any β-pseudometric space is a metrizable topological space.
This approach of viewing dcov β not as a different object on the same space, but as the same object on a different space might not be very intuitive at first. However, since the concept of (strong) negative type does not require a metric space, this characterisation allows us to still use the relation between (strong) negative type of the underlying space and the distance covariance. This leads to the question of whether (X , d β ) is of (strong) negative type, given the original metric space (X , d), for which some criteria are known -see for example Corollary 3 or, more generally, [9] and [12].
Note that if β ∈ (0, 1], d β is indeed still a metric, and we can rely on the already developed theory for separable metric spaces. Thus, we get the following result. Proof. Theorem 1 follows immediately. For Theorem 2, we note that d β induces the same Borel σ-algebra as d. Furthermore, by Remark 3.19 in [10], the resulting metric spaces are still of negative type. For β ∈ (1, 2), while we cannot rely on the triangle inequality, the Jensen inequality gives us a result which we will call the weak triangle inequality. Specifically, for any β ∈ [1,2]: for all x, x ′ , x 0 ∈ X . This can be further bounded by replacing the factor 2 β−1 by 2.
Like in the metric case, we say that a probability measure µ has finite first moment if there exists an element x 0 ∈ X such that d(x, x 0 ) dµ(x) < ∞. Again, the choice of x 0 is arbitrary due to the weak triangle inequality. Thus, we can define the objects a µ , D(µ) and d µ as in the metric case.

Lemma 5. If µ has finite βp-moment, then d (β)
µ has finite 2p-moment with respect to µ 2 and finite p-moment on the diagonal for any p ≥ 1.
Proof. We take inspiration from the proof of Proposition 2.6 in [10]. Define the functions and, using the weak triangle inequality, |f + | ≤ 4d β (x 2 , x 3 ). Similarly, we have In the same way, one shows that the absolute value of f (x 1 , ..., x 4 ) can also be bounded by 4d β (x 1 , x 4 ). Therefore |h(x 1 , ..., µ has finite p-moment on the diagonal. We can now define δ θ and dcov(θ) analogously to the metric case. Since the relevant proofs do not make use of the triangle inequality, it follows from [10] that for pseudometric spaces of strong negative type θ = µ ⊗ ν if and only if dcov(θ) = 0. This, together with the next Lemma, gives a very easy proof of Theorem 4.2 in [6]. Proof. Without loss of generality, assume H to be equal to L 2 [0, 1]. By Theorem 5 in [12], for any β ∈ (0, 2], there exists an embedding Φ : 2 for all x, x ′ ∈ H, which implies that (H, . β ) is of negative type. By Remark 3.19 in [10] (which, along with all its auxiliary results, also holds for pseudometric spaces), the space (H, . β ) therefore has strong negative type for all β ∈ (0, 2).
We can use this Lemma to adapt Corollary 5.9 from [9]. Corollary 3. Let (X , d) be a metric space. If there exists an isometric embedding from X into a separable Hilbert space H, then (X , d β ) is of negative type for all β ∈ (0, 2] and of strong negative type for all β ∈ (0, 2).
Proof. Fix β ∈ (0, 2], and let ϕ : X → L 2 [0, 1] be an isometric embedding. By Lemma 6, (H, . β H ) is of negative type via some embedding Φ, which implies that (X , d β ) is of negative type via (Φ • ϕ). If β < 2, then (H, . β H ) is of strong negative type, and so, for any two probability measures µ 1 , µ 2 on X , we have that where µ ϕ i denotes the pushforward of µ i via ϕ. We can extend the last integral to the entire space H, because the pushforward measures vanish on ϕ(X ) C . Using the strong negative type of (H, . β H ), this gives us µ ϕ 1 = µ ϕ 2 , which implies µ 1 = µ 2 , since ϕ is injective. β ∈ (1, 2). Then, if we replace the finite first moment condition of Theorem 1 by a finite β-moment assumption, Theorem 1 still holds for dcov β . If we furthermore assume X and Y to be isometrically embeddable into separable Hilbert spaces, and replace the finite (1+ε)-condition with a finite (1 + ε)β-moment assumption, then Theorem 3 still holds for dcov β .

Corollary 4. Let
Proof. We first consider Theorem 1. We can replace (4) by |h(z 1 , ..., z 6 )| ≤ 16d β (x 2 , x 3 )d β (y 1 , y 4 ) as we have done in the proof of Lemma 5. This changes the original bound only by constant, which does not affect the remainder of the proof.
If X and Y are isometrically embeddable into separable Hilbert spaces, then by Corollary 3 the spaces resulting from raising their metrics to the power β are of negative type. By Lemma 5, the proof of Theorem 3 still holds for β-pseudometric spaces. We can therefore apply Theorem 3 to the spaces (X , d β ) and (Y, d β ).