Skip to main content
Log in

The exact equivalence of distance and kernel methods in hypothesis testing

  • Original Paper
  • Published:
AStA Advances in Statistical Analysis Aims and scope Submit manuscript

Abstract

Distance correlation and Hilbert-Schmidt independence criterion are widely used for independence testing, two-sample testing, and many inference tasks in statistics and machine learning. These two methods are tightly related, yet are treated as two different entities in the majority of existing literature. In this paper, we propose a simple and elegant bijection between metric and kernel. The bijective transformation better preserves the similarity structure, allows distance correlation and Hilbert-Schmidt independence criterion to be always the same for hypothesis testing, streamlines the code base for implementation, and enables a rich literature of distance-based and kernel-based methodologies to directly communicate with each other.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

References

  • Balasubramanian, K., Sriperumbudur, B., Lebanon, G.: Ultrahigh dimensional feature screening via rkhs embeddings. In Proceedings of Machine Learning Research, pp. 126–134 (2013)

  • Chang, B., Kruger, U., Kustra, R., Zhang, J.: Canonical correlation analysis based on Hilbert-Schmidt independence criterion and centered kernel target alignment. In: International Conference on Machine Learning, pp. 316–324 (2013)

  • Fokianos, K., Pitsillou, M.: Testing independence for multivariate time series via the auto-distance correlation matrix. Biometrika 105(2), 337–352 (2018)

    Article  MathSciNet  Google Scholar 

  • Fukumizu, K., Gretton, A., Sun, X., Schlkopf, B.: Kernel measures of conditional dependence. In: Advances in neural information processing systems (2007)

  • Good, P.: Permutation, Parametric, and Bootstrap Tests of Hypotheses. Springer, Berlin (2005)

    MATH  Google Scholar 

  • Gretton, A., Gyorfi, L.: Consistent nonparametric tests of independence. J. Mach. Learn. Res. 11, 1391–1423 (2010)

    MathSciNet  MATH  Google Scholar 

  • Gretton, A., Herbrich, R., Smola, A., Bousquet, O., Scholkopf, B.: Kernel methods for measuring independence. J. Mach. Learn. Res. 6, 2075–2129 (2005)

    MathSciNet  MATH  Google Scholar 

  • Heller, R., Heller, Y., Gorfine, M.: A consistent multivariate test of association based on ranks of distances. Biometrika 100(2), 503–510 (2013)

    Article  MathSciNet  Google Scholar 

  • Heller, R., Heller, Y., Kaufman, S., Brill, B., Gorfine, M.: Consistent distribution-free $k$-sample and independence tests for univariate random variables. J. Mach. Learn. Res. 17(29), 1–54 (2016)

    MathSciNet  MATH  Google Scholar 

  • Kim, I., Balakrishnan, S., Wasserman, L.: Robust multivariate nonparametric tests via projection-pursuit (2018). arXiv:1803.00715

  • Lee, Y., Shen, C., Priebe, C.E., Vogelstein, J.T.: Network dependence testing via diffusion maps and distance-based correlations. Biometrika 106(4), 857–873 (2019)

    Article  MathSciNet  Google Scholar 

  • Li, R., Zhong, W., Zhu, L.: Feature screening via distance correlation learning. J. Am. Stat. Assoc. 107, 1129–1139 (2012)

    Article  MathSciNet  Google Scholar 

  • Lyons, R.: Distance covariance in metric spaces. Ann. Probab. 41(5), 3284–3305 (2013)

    Article  MathSciNet  Google Scholar 

  • Mehta, R., Chung, J., Shen, C., Ting, X., Vogelstein, J.T.: Independence testing for multivariate time series (2020). arXiv:1908.06486

  • Micchelli, C., Xu, Y., Zhang, H.: Universal kernels. J. Mach. Learn. Res. 7, 2651–2667 (2006)

    MathSciNet  MATH  Google Scholar 

  • Ng, A., Jordan, M., Weiss, Y.: On spectral clustering: analysis and an algorithm. In: Proceedings of the 14th International Conference on Neural Information Processing Systems, pp. 849 – 856 (2001)

  • Pan, W., Wang, X., Xiao, W., Zhu, H.: A generic sure independence screening procedure. J. Am. Stat. Assoc. 114, 928–937 (2018)

    Article  MathSciNet  Google Scholar 

  • Panda, S., Shen, C., Priebe, C.E., Vogelstein, J.T.: Multivariate multisample multiway nonparametric manova (2020). arXiv:1910.08883

  • Rizzo, M., Szekely, G.: DISCO analysis: a nonparametric extension of analysis of variance. Ann. Appl. Stat. 4(2), 1034–1055 (2010)

    Article  MathSciNet  Google Scholar 

  • Rizzo, M., Szekely, G.: Energy distance. Wiley Interdiscip. Rev. Comput. Stat. 8(1), 27–38 (2016)

    Article  MathSciNet  Google Scholar 

  • Sejdinovic, D., Sriperumbudur, B., Gretton, A., Fukumizu, K.: Equivalence of distance-based and rkhs-based statistics in hypothesis testing. Ann. Stat. 41(5), 2263–2291 (2013)

    Article  MathSciNet  Google Scholar 

  • Shen, C.: High-dimensional independence testing and maximum marginal correlation (2020). arXiv:2001.01095

  • Shen, C., Priebe, C.E., Vogelstein, J.T.: From distance correlation to multiscale graph correlation. J. Am. Stat. Assoc. 115(529), 280–291 (2020)

    Article  MathSciNet  Google Scholar 

  • Shen, C., Vogelstein, J.T.: The chi-square test of distance correlation (2020). arXiv:1912.12150

  • Shi, J., Malik, J.: Normalized cuts and image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 22(8), 888–905 (2000)

    Article  Google Scholar 

  • Song, L., Smola, A., Gretton, A., Borgwardt, K., Bedo, J.: Supervised feature selection via dependence estimation. In: ICML ’07 Proceedings of the 24th International Conference on Machine learning, pp. 823–830 (2007)

  • Szekely, G., Rizzo, M.: Hierarchical clustering via joint between-within distances: extending Ward’s minimum variance method. J. Classif. 22, 151–183 (2005)

    Article  MathSciNet  Google Scholar 

  • Szekely, G., Rizzo, M.: Brownian distance covariance. Ann. Appl. Stat. 3(4), 1233–1303 (2009)

    MathSciNet  MATH  Google Scholar 

  • Szekely, G., Rizzo, M.: Partial distance correlation with methods for dissimilarities. Ann. Stat. 42(6), 2382–2412 (2014)

    Article  MathSciNet  Google Scholar 

  • Szekely, G., Rizzo, M., Bakirov, N.: Measuring and testing independence by correlation of distances. Ann. Stat. 35(6), 2769–2794 (2007)

    Article  Google Scholar 

  • Vogelstein, J.T., Wang, Q., Bridgeford, E., Priebe, C.E., Maggioni, M., Shen, C.: Discovering and deciphering relationships across disparate data modalities. eLife 8, e41690 (2019)

    Article  Google Scholar 

  • von Luxburg, U.: A tutorial on spectral clustering. Stat. Comput. 17(4), 395–416 (2007)

    Article  MathSciNet  Google Scholar 

  • Wang, X., Pan, W., Hu, W., Tian, Y., Zhang, H.: Conditional distance correlation. J. Am. Stat. Assoc. 110(512), 1726–1734 (2015)

    Article  MathSciNet  Google Scholar 

  • Wang, S., Shen, C., Badea, A., Priebe, C.E., Vogelstein, J.T.: Signal subgraph estimation via iterative vertex screening (2019). arXiv:1801.07683

  • Xiong, J., Arroyo, J., Shen, C., Vogelstein, J.T.: Graph independence testing: applications in multi-connectomics (2020). arXiv:1906.03661

  • Zhang, Q., Filippi, S., Gretton, A., Sejdinovic, D.: Large-scale kernel methods for independence testing. Stat. Comput. 28(1), 113–130 (2018)

    Article  MathSciNet  Google Scholar 

  • Zhou, Z.: Measuring nonlinear dependence in timeseries, a distance correlation approach. J. Time Ser. Anal. 33(3), 438–457 (2012)

    Article  Google Scholar 

  • Zhong, W., Zhu, L.: An iterative approach to distance correlation-based sure independence screening. J. Stat. Comput. Simul. 85(11), 2331–2345 (2015)

    Article  MathSciNet  Google Scholar 

  • Zhu, L., Xu, K., Li, R., Zhong, W.: Projection correlation between two random vectors. Biometrika 104(4), 829–843 (2017)

    Article  MathSciNet  Google Scholar 

Download references

Acknowledgements

This work was graciously supported by the Microsoft Research, the National Science Foundation Award DMS-1921310, the Defense Advanced Research Projects Agency’s (DARPA) SIMPLEX program through SPAWAR Contract N66001-15-C-4041, and DARPA Lifelong Learning Machines program through Contract FA8650-18-2-7834. The authors thank Dr. Minh Tang, Dr. Carey Priebe, and Dr. Franca Guilherme for their comments and suggestions. We thank Dr. Arthur Gretton and anonymous feedback regarding the asymptotic property of the bijection. We further thank the journal editor and reviewers for their valuable suggestions that significantly improved the organization and exposition of the paper.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Cencheng Shen.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix: Proofs

Appendix: Proofs

Theorem 1

Given a metric \(d(\cdot ,\cdot )\) and a fixed point z, the bijective induced kernel and the fixed-point induced kernel are related via

$$\begin{aligned} {\hat{k}}_d(x_i,x_j)={\tilde{k}}_d(x_i,x_j)+f(x_i)+f(x_j) \end{aligned}$$

for the shift function \(f(x_{i})=\max \limits _{s,t \in [n]}(d(x_s,x_t))/2-d(x_{i},z).\)

Given a positive definite and translation invariant kernel \(k(\cdot ,\cdot )\), the bijective induced metric and the fixed-point induced metric are the same, i.e., for any \(x_i, x_j\) we have

$$\begin{aligned} {\hat{d}}_k(x_i,x_j)={\tilde{d}}_k(x_i,x_j). \end{aligned}$$

Proof

The first part from metric to kernel follows easily by algebraic manipulation. To prove the second part where the two induced metrics are the same, it suffices to show that for any sample data \(\{x_i\}\) and any \(i \in [n]\), it holds that

$$\begin{aligned} k(x_{i},x_{i})=\max \limits _{s,t \in [n]}(k(x_s,x_t)) \end{aligned}$$

Note that because of translation invariance, \(k(x_{i},x_{i})=k(x_{j},x_{j})\) for any \(i \ne j\), so \(k(x_{i},x_{i})\) must be the same maximum element for any i.

We prove by contradiction: suppose \(k(x_{i},x_{i})\) is not the maximum element. Thus there must exist some i and \(j \ne i\) such that \(k(x_{i},x_{j}) > k(x_{i},x_{i})\). Denote \({\mathbf {K}}\) as the \(n \times n\) kernel matrix of sample data \(\{x_i\}\). Let a be the \(1 \times n\) vector with 1 at ith entry, \(-1\) at jth entry, and zero elsewhere. Then

$$\begin{aligned} a {\mathbf {K}} a^{T}=k(x_{i},x_{i})+k(x_{j},x_{j})-2k(x_{i},x_{j}) < 0. \end{aligned}$$

This contradicts the kernel being positive definite. Therefore, \(k(x_{i},x_{i})\) must be the maximum element, and the two induced metrics are the same. \(\square\)

Theorem 2

The bijective induced kernel satisfies the following properties:

  1. 1.

    Non-negativity: \({\hat{k}}_{d}(x_i,x_j) \ge 0\) when \(d(x_i,x_j) \ge 0\).

  2. 2.

    Identity: \({\hat{k}}_{d}(x_i,x_j)=\max \limits _{s,t \in [n]}\{{\hat{k}}_{d}(x_s,x_t)\}\) implies \(x_i=x_j\), if and only if \(d(x_i,x_j)=0\) implies \(x_i=x_j\).

  3. 3.

    Symmetry: \({\hat{k}}_{d}(x_i,x_j)={\hat{k}}_{d}(x_j,x_i)\) if and only if \(d(x_i,x_j)=d(x_j,x_i)\).

  4. 4.

    Negative-Type Metric to Positive Definite Kernel: the bijective induced kernel \({\hat{k}}_{d}(\cdot ,\cdot )\) is positive definite if and only if \(d(\cdot ,\cdot )\) is of negative type.

  5. 5.

    Bijectivity:

    $$\begin{aligned} {\hat{k}}_{{\hat{d}}_{k}}(\cdot ,\cdot )=k(\cdot ,\cdot ). \end{aligned}$$
  6. 6.

    Rank Preserving:

    $$\begin{aligned} d(x_i,x_s) < d(x_i,x_t)&\Rightarrow {\hat{k}}_{d}(x_i,x_s) >{\hat{k}}_{d}(x_i,x_t). \end{aligned}$$
  7. 7.

    Translation Invariant: When \(d(\cdot ,\cdot )\) is translation invariant, \({\hat{k}}_{d}(\cdot ,\cdot )\) is also translation invariant.

Proof

As all other properties are straightforward to verify, here we only prove property (4). Without loss of generality, we assume the maximum metric or kernel element is always 1, since the property is invariant to constant scaling. Denote \({\mathbf {a}}=[a_{1},\ldots ,a_{n}]\) as a row vector, its mean as \(\frac{1}{n}\sum _{i=1}^{n} a_{i}= {\bar{a}}\), \({\mathbf {1}}\) as the column vector of ones, and \({\mathbf {b}}\) as a zero-mean vector. Note that \({\mathbf {a}}-{\bar{a}} {\mathbf {1}}^{T}\) is always a zero-mean vector. Given sample data \(\{x_i, i=1,\ldots ,n\}\), denote \({\mathbf {D}}\) as the \(n \times n\) distance matrix, \(\hat{{\mathbf {K}}}_{{\mathbf {D}}}\) as the \(n \times n\) bijective induced kernel matrix, and recall \({\mathbf {J}}\) is the matrix of ones.

To prove the only if direction, it suffices to show that when \({\mathbf {a}} \hat{{\mathbf {K}}}_{{\mathbf {D}}}{\mathbf {a}}^{T} \ge 0\) for any vector \({\mathbf {a}}\), it holds that \({\mathbf {b}}{\mathbf {D}}{\mathbf {b}}^{T} \le 0\) for any zero-mean vector \({\mathbf {b}}\). It follows that

$$\begin{aligned} {\mathbf {b}}{\mathbf {D}}{\mathbf {b}}^{T}={\mathbf {b}}({\mathbf {J}}-\hat{{\mathbf {K}}}_{{\mathbf {D}}}){\mathbf {b}}^{T} ={\mathbf {b}}(-\hat{{\mathbf {K}}}_{{\mathbf {D}}}){\mathbf {b}}^{T} \le 0. \end{aligned}$$

where \({\mathbf {D}}={\mathbf {J}}-\hat{{\mathbf {K}}}_{{\mathbf {D}}}\) is the bijection, \({\mathbf {b}}{\mathbf {J}}{\mathbf {b}}^{T}=0\), and the last inequality follows from observing that \(-{\mathbf {a}} \hat{{\mathbf {K}}}_{{\mathbf {D}}}{\mathbf {a}}^{T} \le 0\) and \({\mathbf {b}}\) is a special case of \({\mathbf {a}}\).

To prove the if direction, it suffices to show that when \({\mathbf {b}}{\mathbf {D}}{\mathbf {b}}^{T} \le 0\) for any zero-mean vector \({\mathbf {b}}\), it holds that \({\mathbf {a}}({\mathbf {J}}-{\mathbf {D}}){\mathbf {a}}^{T} \ge 0\) for any vector \({\mathbf {a}}\). This is established by the following:

$$\begin{aligned}&({\mathbf {b}}+ {\bar{a}}{\mathbf {1}}^{T})({\mathbf {J}}-{\mathbf {D}})({\mathbf {b}}+ {\bar{a}}{\mathbf {1}}^{T})^{T} \\&\quad = {\bar{a}}^2{\mathbf {1}}^{T} {\mathbf {J}}{\mathbf {1}}-{\mathbf {b}} {\mathbf {D}}{\mathbf {b}}^{T}- 2{\bar{a}}{\mathbf {1}}^{T} {\mathbf {D}}{\mathbf {b}}^{T} - {\bar{a}}^2{\mathbf {1}}^{T} {\mathbf {D}}{\mathbf {1}}\\&\quad = -{\mathbf {b}} {\mathbf {D}}{\mathbf {b}}^{T}+{\bar{a}}^2{\mathbf {1}}^{T} {\mathbf {J}}{\mathbf {1}}- 2{\bar{a}}{\mathbf {1}}^{T} {\mathbf {D}}{\mathbf {a}}^{T} + {\bar{a}}^2{\mathbf {1}}^{T} {\mathbf {D}}{\mathbf {1}} \\&\quad = -{\mathbf {b}} {\mathbf {D}}{\mathbf {b}}^{T}+n^2 {\bar{a}}^2 - 2{\bar{a}}\sum \limits _{i=1}^{n} \{a_i \sum \limits _{j=1}^{n}{\hat{d}}_{k}(x_i,x_j)\} + {\bar{a}}^2\sum \limits _{i,j=1}^{n} {\hat{d}}_{k}(x_i,x_j)\\&\quad \ge -{\mathbf {b}} {\mathbf {D}}{\mathbf {b}}^{T} \ge 0, \end{aligned}$$

where the first equality follows by expanding all terms and eliminating any term containing \({\mathbf {b}}{\mathbf {1}}=0\); the second equality follows by noting that \({\mathbf {b}}=({\mathbf {a}}- {\bar{a}}{\mathbf {1}}^{T})\) and expand the term \(2{\bar{a}}{\mathbf {1}}^{T} {\mathbf {D}}{\mathbf {b}}^{T}\) accordingly; the third equality decomposes the matrix notations into summations for every term except \({\mathbf {b}} {\mathbf {D}}{\mathbf {b}}^{T}\). The last line follows because \(-{\mathbf {b}} {\mathbf {D}}{\mathbf {b}}^{T} \ge 0\), and the other three terms on the fourth line is no smaller than 0:

$$\begin{aligned}&n^2 {\bar{a}}^{2} - 2{\bar{a}}\sum \limits _{i=1}^{n} \{a_i \sum \limits _{j=1}^{n}d(x_i,x_j)\} + {\bar{a}}^2\sum \limits _{i,j=1}^{n} d(x_i,x_j) \\&\quad = n^2 {\bar{a}}^{2} - {\bar{a}} \sum \limits _{i=1}^{n} \{(2a_i -{\bar{a}}) \sum \limits _{j=1}^{n}d(x_i,x_j)\}\\&\quad \ge n^2 {\bar{a}}^{2} - n{\bar{a}} \sum \limits _{i=1}^{n} (2a_i-{\bar{a}}) \\&\quad = 2n^2 {\bar{a}}^2 - 2n^2 {\bar{a}}^2 =0 \end{aligned}$$

The third lines follows by noting that \(d(x_i,x_j) \le 1\) because the maximum kernel element is assumed to be 1. \(\square\)

Theorem 3

Suppose distance covariance uses a given metric \(d(\cdot ,\cdot )\), and the Hilbert-Schmidt independence criterion \({\hat{{ Hsic}}}\) uses the bijective induced kernel \({\hat{k}}_{d}(\cdot ,\cdot ).\)

Given any sample data \(({\mathbf {X}},{\mathbf {Y}})\), it holds that

$$\begin{aligned}&{Dcov}_{n}^{b}({\mathbf {X}},{\mathbf {Y}})= {\hat{{Hsic}}}_{n}^{b}({\mathbf {X}},{\mathbf {Y}}),\\&{Dcov}_{n}({\mathbf {X}},{\mathbf {Y}}) = {\hat{{Hsic}}}_{n}({\mathbf {X}},{\mathbf {Y}}) + O(\frac{1}{n^2}), \end{aligned}$$

where the remainder term \(O(\frac{1}{n^2})\) is invariant to permutation.

Therefore, sample distance covariance and sample Hilbert-Schmidt covariance always yield the same p value under permutation test.

Proof

It suffices to prove the equivalence of the two covariance terms. Then the permuted statistics using distance covariance and Hilbert-Schmidt covariance are always the same (using the same permutation), leading to same p value using permutation test.

Without loss of generality, assume \(\max \limits _{s,t \in [1,n]}(d(x_s,x_t))=\max \limits _{s,t \in [1,n]}(d(y_s,y_t))=1\). Given sample data \({\mathbf {X}}\), we denote \({\mathbf {D}}^{{\mathbf {X}}}\) as the distance matrix, the bijective induced kernel matrix as \(\hat{{\mathbf {K}}}_{{\mathbf {D}}}^{{\mathbf {X}}}\), and similarly for \({\mathbf {Y}}\). The equivalence for the biased covariances follows as:

$$\begin{aligned}&{Dcov}_{n}^{b}({\mathbf {X}},{\mathbf {Y}}) - {\hat{{Hsic}}}_{n}^{b}({\mathbf {X}},{\mathbf {Y}}) \\&\quad = \frac{1}{n^2}\{ trace({\mathbf {D}}^{{\mathbf {X}}}{\mathbf {H}}{\mathbf {D}}^{{\mathbf {Y}}}{\mathbf {H}})- trace(\hat{{\mathbf {K}}}_{{\mathbf {D}}}^{{\mathbf {X}}}{\mathbf {H}}\hat{{\mathbf {K}}}_{{\mathbf {D}}}^{{\mathbf {Y}}}{\mathbf {H}}) \}\\&\quad = \frac{1}{n^2}\{ trace({\mathbf {D}}^{{\mathbf {X}}}{\mathbf {H}}{\mathbf {D}}^{{\mathbf {Y}}}{\mathbf {H}})- trace(({\mathbf {J}}-{\mathbf {D}}^{{\mathbf {X}}}){\mathbf {H}}({\mathbf {J}}-{\mathbf {D}}^{{\mathbf {Y}}}){\mathbf {H}}) \}\\&\quad = \frac{1}{n^2}\{ trace({\mathbf {D}}^{{\mathbf {X}}}{\mathbf {H}}{\mathbf {J}}{\mathbf {H}}+{\mathbf {D}}^{{\mathbf {Y}}}{\mathbf {H}}{\mathbf {J}}{\mathbf {H}}-{\mathbf {J}}{\mathbf {H}}{\mathbf {J}}{\mathbf {H}}) \}\\&\quad = 0, \end{aligned}$$

where the last equality i.e., because of

$$\begin{aligned} {\mathbf {J}}{\mathbf {H}}={\mathbf {J}}({\mathbf {I}}-\frac{1}{n}{\mathbf {J}})={\mathbf {J}}-{\mathbf {J}}=0. \end{aligned}$$

To prove the unbiased version, we denote the modified matrices as \({\mathbf {C}}^{{\mathbf {X}}}\) and \(\hat{{\mathbf {C}}}^{{\mathbf {X}}}\) respectively corresponding to the distance matrix \({\mathbf {D}}^{{\mathbf {X}}}\) and the induced kernel matrix \(\hat{{\mathbf {K}}}_{{\mathbf {D}}}^{{\mathbf {X}}}\). As their diagonal entries are always 0, it suffices to analyze the off-diagonal entries of \({\mathbf {C}}^{{\mathbf {X}}}\) and \(\hat{{\mathbf {C}}}^{{\mathbf {X}}}\): for each \(i \ne j\),

$$\begin{aligned} \hat{{\mathbf {C}}}^{{\mathbf {X}}}(i,j)&= \hat{{\mathbf {K}}}^{{\mathbf {X}}}_{{\mathbf {D}}}(i,j)-\frac{1}{n-2}\sum \limits _{t=1}^{n} \hat{{\mathbf {K}}}^{{\mathbf {X}}}_{{\mathbf {D}}}(i,t)\\&-\frac{1}{n-2}\sum \limits _{s=1}^{n} \hat{{\mathbf {K}}}^{{\mathbf {X}}}_{{\mathbf {D}}}(s,j)+\frac{1}{(n-1)(n-2)}\sum \limits _{s,t=1}^{n}\hat{{\mathbf {K}}}^{{\mathbf {X}}}_{{\mathbf {D}}}(s,t) \\&\quad = (1-{\mathbf {D}}^{{\mathbf {X}}}(i,j))-\frac{1}{n-2}\sum \limits _{t=1}^{n} (1-{\mathbf {D}}^{{\mathbf {X}}}(i,t))\\&\qquad -\frac{1}{n-2}\sum \limits _{s=1}^{n} (1-{\mathbf {D}}^{{\mathbf {X}}}(s,j))+\frac{1}{(n-1)(n-2)}\sum \limits _{s,t=1}^{n}(1-{\mathbf {D}}^{{\mathbf {X}}}(s,t)) \\&\quad = -({\mathbf {C}}^{{\mathbf {X}}}(i,j)+\frac{1}{n-1}). \end{aligned}$$

Therefore, the unbiased sample Hilbert-Schmidt covariance satisfies

$$\begin{aligned} {\hat{{Hsic}}}_{n}({\mathbf {X}}, {\mathbf {Y}})&= \frac{1}{n(n-3)}trace(\hat{{\mathbf {C}}}^{{\mathbf {X}}}\hat{{\mathbf {C}}}^{{\mathbf {Y}}}) \\&= \frac{1}{n(n-3)}trace(({\mathbf {C}}^{{\mathbf {X}}}+\frac{{\mathbf {J}}-{\mathbf {I}}}{n-1})({\mathbf {C}}^{{\mathbf {Y}}}+\frac{{\mathbf {J}}-{\mathbf {I}}}{n-1})) \\&= {Dcov}_{n}({\mathbf {X}}, {\mathbf {Y}}) + \frac{trace((n-1){\mathbf {C}}^{{\mathbf {X}}}({\mathbf {J}}-{\mathbf {I}})+(n-1){\mathbf {C}}^{{\mathbf {Y}}}({\mathbf {J}}-{\mathbf {I}})+({\mathbf {J}}-{\mathbf {I}})^2)}{n(n-1)^2(n-3)}\\&= {Dcov}_{n}({\mathbf {X}}, {\mathbf {Y}}) + O(\frac{1}{n^2}) \end{aligned}$$

As the remainder term is invariant to permutation, the p value is always the same between \({\hat{{Hsic}}}_{n}({\mathbf {X}}, {\mathbf {Y}})\) and \({Dcov}_{n}({\mathbf {X}}, {\mathbf {Y}})\). As evident by the second line above, the two unbiased sample covariances can be made exactly the same by adding \(\frac{1}{n}\) to the modified matrices \(\hat{{\mathbf {C}}}^{{\mathbf {X}}}\) and \(\hat{{\mathbf {C}}}^{{\mathbf {Y}}}\).

Note that one can replicate the equivalence proof between the biased covariances using the fixed-point induced kernel. However, such equivalence does not hold between the unbiased covariances using the fixed-point induced kernel: as the fixed-point terms no longer cancel out, the remainder term is dependent on the choice of fixed-point and no longer invariant to permutation. \(\square\)

Theorem 4

Given a random variable pair (XY) and a given metric \(d(\cdot ,\cdot )\) where the population distance covariance is well-defined as in Equation 2. For the bijective induced kernel \({\hat{k}}_{d}(\cdot ,\cdot )\), we define the population Hilbert-Schmidt covariance as the limit of sample covariance:

$$\begin{aligned} {\hat{{Hsic}}}(X,Y)=\lim _{n \rightarrow \infty }{\hat{{Hsic}}}_{n}({\mathbf {X}},{\mathbf {Y}}), \end{aligned}$$

Then the population Hilbert-Schmidt covariance equals the population distance covariance, i.e.,

$$\begin{aligned}&{\hat{{Hsic}}}(X,Y)={Dcov}(X,Y)\\&=E[d(X,X^{\prime })d(Y,Y^{\prime })]+E[d(X,X^{\prime })]E[d(Y,Y^{\prime })]-2E[d(X,X^{\prime })d(Y,Y^{\prime \prime })]. \end{aligned}$$

Proof

Denote

$$\begin{aligned} \max \limits _{s,t \in [1,n]}(d(x_s,x_t))&=a_n \\ \max \limits _{s,t \in [1,n]}(d(y_s,y_t))&=b_n. \end{aligned}$$

Based on the definition of \({\hat{{Hsic}}}(X,Y)\) for induced kernel and the sample equivalence so far, it follows that

$$\begin{aligned}&{\hat{{Hsic}}}(X,Y) - {Dcov}(X,Y)\\&\quad = \lim _{n \rightarrow \infty }\{{\hat{{ Hsic}}}_{n}({\mathbf {X}},{\mathbf {Y}}) -{Dcov}_{n}({\mathbf {X}},{\mathbf {Y}})\} \\&\quad = \lim _{n \rightarrow \infty } \frac{1}{n^2} \{trace(\hat{{\mathbf {K}}}^{{\mathbf {X}}}_{{\mathbf {D}}}{\mathbf {H}}\hat{{\mathbf {K}}}^{{\mathbf {Y}}}_{{\mathbf {D}}}{\mathbf {H}})-trace({\mathbf {D}}^{{\mathbf {X}}}{\mathbf {H}} {\mathbf {D}}^{{\mathbf {Y}}}{\mathbf {H}})\} \\&\quad = \lim _{n \rightarrow \infty } \frac{1}{n^2} \{trace((a_n {\mathbf {J}}- {\mathbf {D}}^{{\mathbf {X}}}){\mathbf {H}}(b_n {\mathbf {J}}- {\mathbf {D}}^{{\mathbf {Y}}}){\mathbf {H}} - {\mathbf {D}}^{{\mathbf {X}}}{\mathbf {H}} {\mathbf {D}}^{{\mathbf {Y}}}{\mathbf {H}})\} \\&\quad = \lim _{n \rightarrow \infty } \frac{1}{n^2} \{trace(a_n b_n {\mathbf {J}}{\mathbf {H}}{\mathbf {J}}{\mathbf {H}}- b_n{\mathbf {D}}^{{\mathbf {X}}}{\mathbf {H}}{\mathbf {J}}-a_n{\mathbf {J}}{\mathbf {H}}{\mathbf {D}}^{{\mathbf {Y}}} \}\\&\quad = 0. \end{aligned}$$

Namely, the maximum elements always cancel out, which does not affect the limiting equivalence even if the maximum increases to infinity. Therefore, it holds that

$$\begin{aligned}&{\hat{{Hsic}}}(X,Y) = {Dcov}(X,Y)\\&\quad = E[k(X,X^{\prime })k(Y,Y^{\prime })]+E[k(X,X^{\prime })]E[k(Y,Y^{\prime })]-2E[k(X,X^{\prime })k(Y,Y^{\prime \prime })], \end{aligned}$$

and the population Hilbert-Schmidt covariance using the induced kernel is always well-defined and equals the population distance covariance. \(\square\)

Theorem 5

When the given metric \(d(\cdot ,\cdot )\) is of strong negative type, the bijective induced kernel \({\hat{k}}_{d}(\cdot ,\cdot )\) is asymptotically a characteristic kernel. When the given kernel \(k(\cdot ,\cdot )\) is characteristic, the bijective induced metric \({\hat{d}}_{k}(\cdot ,\cdot )\) is asymptotically of strong negative type.

Proof

Given any characteristic kernel \(k(\cdot ,\cdot )\), the sample Hsic converges to 0 if and only if X and Y are independent. Then distance covariance using the bijective induced metric \({\hat{d}}_{k}(\cdot ,\cdot )\) is exactly the same and also converges to 0 if and only if X and Y are independent by Theorem 4. Thus the induced metric must be asymptotically of strong negative type by Lyons (2013). Conversely, when \(d(\cdot ,\cdot )\) is of strong negative type, \({\hat{k}}_{d}(\cdot ,\cdot )\) must be asymptotically characteristic. \(\square\)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Shen, C., Vogelstein, J.T. The exact equivalence of distance and kernel methods in hypothesis testing. AStA Adv Stat Anal 105, 385–403 (2021). https://doi.org/10.1007/s10182-020-00378-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10182-020-00378-1

Keywords

Navigation