Large-Scale Kernel Methods for Independence Testing

Representations of probability measures in reproducing kernel Hilbert spaces provide a flexible framework for fully nonparametric hypothesis tests of independence, which can capture any type of departure from independence, including nonlinear associations and multivariate interactions. However, these approaches come with an at least quadratic computational cost in the number of observations, which can be prohibitive in many applications. Arguably, it is exactly in such large-scale datasets that capturing any type of dependence is of interest, so striking a favourable tradeoff between computational efficiency and test performance for kernel independence tests would have a direct impact on their applicability in practice. In this contribution, we provide an extensive study of the use of large-scale kernel approximations in the context of independence testing, contrasting block-based, Nystrom and random Fourier feature approaches. Through a variety of synthetic data experiments, it is demonstrated that our novel large scale methods give comparable performance with existing methods whilst using significantly less computation time and memory.


Introduction
Given a paired sample z = {(x i , y i )} m i=1 with each (x i , y i ) ∈ X × Y independently and identically following the joint distribution P XY on some generic domains X and Y, the nonparametric independence problem consists of testing if we should reject the null hypothesis H 0 : P XY = P X P Y in favour of the general alternative hypothesis H 1 : P XY = P X P Y , where P X and P Y are the marginal distributions for X and Y respectively.This problem is fundamental and extensively studied, with wide-ranging applications in statistical inference and modelling.Classical dependence measures, such as Pearson's productmoment correlation coefficient, Spearman's ρ, Kendall's τ or methods based on contingency tables are typically designed to capture only particular forms of dependence (e.g.linear or monotone).Furthermore, they are applicable only to scalar random variables or require space partitioning limiting their use to relatively low dimensions.As availability of larger datasets also facilitates building more complex models, dependence measures are sought that capture more complex dependence patterns and those that occur between multivariate and possibly high-dimensional datasets.In this light, among the most popular dependence measures recently have been those based on characteristic functions [44,43] as well as a broader framework based on kernel methods [15,17].A desirable property of consistency against any alternative -i.e.test power provably increasing to one with the sample size regardless of the form of dependence, is warranted for statistical tests based on such approaches.However, this is achieved at an expense of computational and memory requirements that increase at least quadratically with the sample size, which is prohibitive for many modern applications.Thus, a natural question is whether a favourable tradeoff between computational efficiency and test power can be sought with appropriate large-scale approximations.As we demonstrate, several large-scale approximations are available in this context and they lead to strong improvements in power-per-computatonal unit performance, resulting in a fast and flexible independence testing framework responsive to all forms of dependence and applicable to large datasets.
The key quantity we consider is the Hilbert-Schmidt Independence Criterion (HSIC) introduced by Gretton et al [15].HSIC uses the distance between the kernel embeddings of probability measures in the Reproducing Kernel Hilbert Space (RKHS) [17,49,36].By building on decades of research into kernel methods for machine learning [31], HSIC can be applied to multivariate observations as well as to those lying in non-Euclidean and structured domains, e.g., [17] considers independence testing on text data.HSIC has also been applied to clustering and learning taxonomies [39,6], feature selection [38], causal inference [27,12,47] and computational linguistics [26].A closely related dependence coefficient that measures all types of dependence between two random vectors of arbitrary dimensions is the distance covariance (dCov) of [44,43], which measures distances between empirical characteristic functions or equivalently measures covariances with respect to a stochastic process [43], and its normalised counterpart, distance correlation (dCor).RKHS-based dependence measures like HSIC are in fact extensions of dCov - [33] shows that dCov can be understood as a form of HSIC with a particular choice of kernel.Moreover, dCor can be viewed as an instance of kernel matrix alignment of [10].As we will see, statistical tests based on estimation of HSIC and dCov are computationally expensive and require at least O(m 2 ) time and storage complexity, where m is the number of observations, just to compute an HSIC estimator which serves as a test statistic.In addition, the complicated form of the asymptotic null distribution of the test statistics necessitates either permutation testing [2] (further increasing the computational cost) or even more costly direct sampling from the null distribution, requiring eigendecompositions of kernel matrices using the spectral test of [16], with a cost of O(m 3 ).These memory and time requirements often make the HSIC-based tests infeasible for practitioners.
In this paper, we consider several ways to speed up the computation in HSIC-based tests.More specifically, we introduce three novel fast estimators of HSIC: the block-based estimator, the Nyström estimator and the random Fourier feature (RFF) estimator and study the resulting independence tests.In the block-based setting, we obtain a simpler asymptotic null distribution as a consequence of the Central Limit Theorem in which only asymptotic variance needs to be estimated -we discuss possible approaches for this.RFF and Nyström estimators correspond to the primal finite-dimensional approximations of the kernel functions and as such also warrant estimation of the null distribution in linear time -we introduce spectral tests based on eigendecompositions of primal covariance matrices, which avoid permutation approach and significantly reduce the computational expense for the direct sampling from the null distribution.

Related Work
Some of the approximation methods considered in this paper were inspired by their use in a related context of two-sample testing.In particular, the blockbased approach for two-sample testing was studied in [14,18,48] under the name of linear-time MMD (Maximum Mean Discrepancy), i.e. the distance between the mean embeddings of the probability distributions in the RKHS.The approach estimates MMD on a small block of data and then averages the estimates over blocks to obtain the final test statistic.Our block-based estimator of HSIC follows exactly the same strategy.On the other hand, The Nyström method [46,37] is a classical low-rank kernel approximation technique, where data is projected into lower-dimensional subspaces of RKHS (spanned by so called inducing variables).Such an idea is popular in fitting sparse approximations to Gaussian process (GP) regression models, allowing reduction of the computational cost from O(m 3 ) to O(n 2 m) where n m is the number of inducing variables.To the best of our knowledge, Nyström approximation was not studied in the context of hypothesis testing.Random Fourier feature (RFF) approximations [28], however, due to their relationship with evaluations of empirical characteristic functions, do have a rich history in the context of statistical testing -as discussed in [9], which also proposes an approach to scale up kernel-based two sample tests by additional smoothing of characteristic functions, thereby improving the test power and its theoretical properties.Moreover, the approximation strategy of MMD and two-sample testing through primal representation using RFF have also been studied in [50,42,21].In addition, [22] first proposed the idea of applying RFF in order to construct an approximation to a kernel-based dependence measure.More specifically, they develop Randomised Canonical Correlation Analysis (RCCA) (see also [23,21]) approximating the non-linear kernel-based generalisation of the Canonical Correlation Analysis [20,4] and using a further copula transformation, construct a test statistic termed RDC (randomised dependence coefficient) requiring O(m log m) time to compute.Under suitable assumptions, Bartlett's approximation [25] provides a closed form expression for the asymptotic null distribution of this statistic which further results in a distribution free test, leading to an attractive option for large-scale independence testing.We extend these ideas based on RFF to construct approximations of HSIC and dCov/dCor, which are conceptually distinct kernel-based dependence measures from that of kernel CCA, i.e., they measure different types of norms of RKHS operators (operator norm vs Hilbert-Schmidt norm).
In fact, the Nyström and RFF approximations can also be viewed through the lense of nonlinear canonical analysis framework introduced by [11].This is the earliest example we know where nonlinear dependence measures based on spectra of appropriate Hilbert space operators are studied.In particular, the cross-correlation operator with respect to a dictionary of basis functions in L 2 (e.g.B-splines) is considered in [11].[19] links this framework to the RKHS perspective.The functions of the spectra that were considered in [11] are very general, but the simplest one (sum of the squared singular values) can be recast as the NOrmalised Cross-Covariance Operator (NOCCO) of [13], which considers the Hilbert-Schmidt norm of the cross-correlation operator on RKHSs and as such extends kernel CCA to consider the entire spectrum.While in this work we focus on HSIC (Hilbert-Schmidt norm of the crosscovariance operator ), which is arguably the most popular kernel dependence measure in the literature, a similar Nyström or RFF approximation can be applied to NOCCO as well -we leave this as a topic for future work.
The paper is structured as follows: in Section 2, we first provide some necessary definitions from the RKHS theory and review the aforementioned Hilbert-Schmidt Independence Criterion (HSIC) and discuss its biased and unbiased quadratic time estimators.Then, Section 2.3 gives the asymptotic null distributions of estimators (proofs provided in Section A).In Section 3, we develop a block-based HSIC estimator and derive its asymptotic null distribution.Following this, a linear time asymptotic variance estimation approach is proposed.In Section 4.1 and 4.2, we propose Nyström HSIC and RFF HSIC estimator respectively, both with the corresponding linear time null distribution estimation approaches.Finally, in Section 5, we explore the performance of the three testing approaches on a variety of challenging synthetic data.

Background
This section starts with a brief overview of the key concepts and notation required to understand the RKHS theory and kernel embeddings of probability distributions into the RKHS.It then provides the definition of HSIC which will serve as a basis for later independence tests.We review the quadratic time biased and unbiased estimators of HSIC as well as their respective asymptotic null distributions.As the final part of this section, we outline the construction of independence tests in quadratic time.

RKHS and Embeddings of Measures
Let Z be any topological space on which Borel measures can be defined.By M(Z) we denote the set of all finite signed Borel measures on Z and by M 1 + (Z) the set of all Borel probability measures on Z.We will now review the basic concepts of RKHS and kernel embeddings of probability measures.For further details, see [5,41,40].
Definition 1 Let H be a Hilbert space of real-valued function defined on Z.
If H has a reproducing kernel, it is called a Reproducing Kernel Hilbert Space (RKHS).
As a direct consequence, for any x, y ∈ Z, In machine learning literature, a notion of kernel is understood as an inner product between feature maps [41].By (1), every reproducing kernel is a kernel in this sense, corresponding to a canonical feature map x → k(•, x).For x, y ∈ R p , some examples of reproducing kernels are -Fractional Brownian motion covariance kernel with parameter H ∈ (0, 1): Checking whether a given function k is a valid reproducing kernel can be onerous.Fortunately, the Moore-Aronszajn theorem [3] gives a simple characterisation: for any symmetric, positive definite function k : Z × Z → R, there exists a unique Hilbert space of functions H defined on Z such that k is reproducing kernel of H [5]. RKHS are precisely the space of functions where norm convergence implies pointwise convergence and are as a consequence relatively well behaved comparing to other Hilbert spaces.In nonparametric testing, as we consider here, a particularly useful setup will be representing probability distributions and, more broadly, finite signed Borel measures ν ∈ M(Z) with elements of an RKHS [36].
Definition 2 Let k be a kernel on Z, and ν ∈ M(Z).The kernel embedding of measure ν into the RKHS It is understood from this definition that the integral of any RKHS function f under the measure ν can be evaluated as the inner product between f and the kernel embedding µ k (ν) in the RKHS H k .As an alternative, the kernel embedding can be defined through the use of Bochner integral µ k (ν) = k(•, z)dν(z).Any probability measure is mapped to the corresponding expectation of the canonical feature map.By Cauchy-Schwarz inequality and the Riesz representation theorem, a sufficient condition for the existence of an embedding of ν is that ν ∈ M 1/2 k (Z), where we adopt notation from [33]: which is, e.g.satisfied for all finite measures if k is a bounded function (such as Gaussian kernel).
Embeddings allow measuring distances between probability measures, giving rise to the notion of Maximum Mean Discrepancy (MMD) [7,14].
Definition 3 Let k be a kernel on Z.The squared distance between the kernel embeddings of two probability measures P and Q in the RKHS, MMD k (P, H k is called Maximum Mean Discrepancy (MMD) between P and Q with respect to k.
When the corresponding kernels are characteristic [40], embedding is injective and MMD is a metric on probability measures.The estimators of MMD are useful statistics in nonparametric two-sample testing [14], i.e. testing if two given samples are drawn from the same probability distribution.For any kernels k X and k Y on the respective domains X and Y, it is easy to check that k is a valid kernel on the product domain which can be viewed as the space of Hilbert-Schmidt operators between H k Y and H k X (Lemma 4.6 of [41]).
We are now ready to define an RKHS-based measure of dependence between random variables X and Y .
Definition 4 Let X and Y be random variables on domains X and Y (nonempty topological spaces).Let k X and k Y be kernels on X and Y respectively.
and Y is MMD between the joint measure P XY and the product of marginals HSIC is well defined whenever [33].The name of HSIC comes from the operator view of the RKHS H k X ⊗k Y .Namely, the difference between embeddings [15,17].HSIC is then simply the squared Hilbert-Schmidt norm C XY 2 HS of this operator, while distance correlation (dCor) of [44,43] can be cast as In the sequel, we will suppress dependence on kernels k X and k Y in notation Ξ k X ,k Y (X, Y ) where there is no ambiguity.

Repeated application of the reproducing property gives the following equivalent representation of HSIC [36]:
Proposition 1 The HSIC of X and Y can be written as:

Estimation of HSIC
Using the form of HSIC in (4), given an iid sample of z = {(x i , y i )} m i=1 from the joint distribution P XY , an unbiased estimator of HSIC can be obtained as a sum of three U-statistics [17]: where the index set i m r denotes the set of all r-tuples drawn without replacement from {1, ..., m}, (K x ) ij := k X (x i , x j ) and (K y ) ij := k Y (y i , y j ).Naïve computation of ( 5) would require O(m 4 ) operations.However, an equivalent form which needs O(m 2 ) operations is given in [38] as where Kx = K x − diag(K x ) (i.e. the kernel matrix with diagonal elements set to zero) and similarly for Ky . 1 is a vector of 1s of relevant dimension.
We will refer to the above as the quadratic time estimator.[17] note that the V -statistic estimator (or the quadratic time biased estimator) of HSIC can be an easier-to-use alternative for the purposes of independence testing, since the bias is accounted for in the asymptotic null distribution.The V -statistic is given by where the summation indices are now drawn with replacement.Further, it can be simplified as follows to reduce the computation: where T is an m × m centering matrix.(7) gives an intuitive understanding of the HSIC statistic: it measures average similarity between the centered kernel matrices, which are in turn similarity patterns within the samples. 1

Asymptotic Null Distribution of Estimators
The asymptotic null distribution of the biased HSIC statistic defined in (7) computed using a given data set converges in distribution in Theorem 1 below.This asymptotic distribution builds the theoretical foundation for the spectral testing approach (described in Section 2.4.2) that we will use throughout the paper.
Theorem 1 (Asymptotic Null Distribution of the Biased HSIC) Under the null hypothesis, let the dataset z where N i,j i.i.d.
∼ N (0,1), ∀ i, j ∈ N and {λ i } ∞ i=1 , {η j } ∞ j=1 are the eigenvalues of the integral kernel operators S kPx and S kPy , where the integral kernel operator where kP (z, z ) is the kernel centred at probability measure P : with W, W i.i.d.

∼ P.
For completeness, the proof of this theorem, which is a consequence of [24, Theorem 2.7] and the equivalence between distance-based and RKHS-based independence statistics [33] is given in the Appendix A. As remarked by [33], the finite marginal moment conditions imply that the integral operators S kX and S kY are trace class and hence Hilbert-Schmidt [29].Anderson et al. noted that the form of the asymptotic distribution of the V statistics requires the integral operators being trace class but that of the U statistics only requires them being Hilbert-Schmidt [1,33].Using the same notation as in the case of the V-statistic, the asymptotic distribution of the U-statistic in (5) can be written as: under the null hypothesis.We note that [8] (Lemma 2 and Theorem 1) proves a more general result, applicable to dependent observations under certain mixing conditions where the i.i.d.setting is a special case.Moreover, [30] (Theorem 5 and 6) provides another elegant proof in the context of three-variable interaction testing from [32].However, both [8] and [30] assume boundedness of k X and k Y , while our proof in the Appendix A assumes a weaker condition of finite second moments for both k X and k Y , thus making the result applicable to unbounded kernels such as the Brownian motion covariance kernel.
Under the alternative hypothesis that P X P Y = P XY , [17] remarked that mΞ b,k X ,k Y (Z) converges to HSIC with the corresponding appropriately centred and scaled Gaussian distribution as m → ∞: where the variance σ 2 u = 16(E zi (E zj ,zq,zr (h ijqr )) 2 −HSIC) and h ijqr is defined as with all ordered quadruples (t, u, v, w) drawn without replacement from (i, j, q, r) and assuming E(h 2 ) < ∞.In fact, under the alternative hypothesis, the difference between mΞ b (Z) (i.e. the V-statistic) and the U-statistic drops as 1/m and hence asymptotically the two statistics converges to the same distribution [17].

Quadratic Time Null Distribution Estimations
We would like to design independence tests with an asymptotic Type I error of α and hence we need an estimate of the (1−α) quantile of the null distribution.
Here, we consider two frequently used approaches, namely the permutation approach and the spectral approach, that require at least quadratic time both in terms of memory and computation time.The biased V-statistic will be used because of its neat and compact formulation.

Permutation Approach
Consider an iid sample z = {(x i , y i )} m i=1 with chosen kernels k X and k Y respectively, the permutation/bootstrap approach [2] proceed in the following manner.Suppose the total number of shuffles is fixed at N p , we first compute Ξ k X ,k Y (z) using z, k X and k Y .Then, for each shuffle, we fix the {x i } m i=1 and randomly permute the The one-sided p-value in this instance is the proportion of HSIC values computed on the permuted data that is greater than or equal to The computational time is O(number of shuffles ×m 2 ) for this approach, where the number of shuffles determines the extend to which we have explored the sampling distribution.In other words, a small number of shuffles means that we may only obtained realisations from the mode of the distribution and hence the tail structure is not adequately captured.Although a larger number of shuffles ensures the proper exploration of the sampling distribution, the computation cost can be high.

Spectral Approach
Gretton et al. has shown (Theorem 1 [16]) that the empirical finite sample estimate of the null distribution converges in distribution to its population counterpart provided that the eigenspectrums {γ r } ∞ r=1 of the integral operator Note, the integral operator S k is the tensor product of the operators S kX and S kX : and the eigenvalues of this operator is hence the product of the eigenvalues of these two operators.
The spectral approach [16,49] requires that we first calculate the centred Gram matrices K X = HK X H and K Y = HK Y H for the chosen kernel k X and k Y .Then, we compute the mΞ b,k X ,k Y (z) statistics according to (7).Next, the spectrums (i.e.eigenvalues) {λ i } m i=1 and {η i } m i=1 of K X and K Y are respectively calculated.The empirical null distribution can be simulated by simulating a large enough i.i.d samples from the standard Normal distribution [49] and then generate the test statistic according to (8).Finally, the p-value is computed by calculating the proportion of simulated samples that are greater than or equal to the observed mΞ b,k X ,k Y (z) value.
Additionally, [49] has provided an approximation to the null distribution with a two-parameter Gamma distribution.Despite the computational advantage of such an approach, the permutation and spectral approaches are still preferred since there is no consistency guarantee for the Gamma distribution approach.

Block-based HSIC
The quadratic time test statistics are prohibitive for large dataset as it requires O(m2 ) time in terms of storage and computation.Furthermore, one requires an approximation of the asymptotic null distribution in order to compute the p-value.As we discussed in the previous section, this is usually done by randomly permute the Y observations (i.e. the permutation approach) or by performing an eigen-decomposition of the centred kernel matrices for X and Y (i.e. the spectral approach).Both approaches are expensive in terms of memory and can be computationally infeasible.In this section, we propose a block-based estimator of HSIC which reduce the computational time to linear in the number of samples.The asymptotic null distribution of this estimator will be shown to have a simple form as a result of the Central Limit Theorem (CLT).

The Block HSIC Statistic
Let us consider that the sample is split into blocks of size B m: (where we assumed for simplicity that m is divisible by B).We follow the approach from [48,34] and extend it to independence testing.We compute the unbiased HSIC statistics (Eq.5) on each block b ∈ {1, ..., m B }: and average them over blocks to establish the block-based estimator for HSIC:

Null Distribution of Block-Based HSIC
For the block HSIC statistic, the asymptotic null distribution is a consequence of the Central Limit Theorem (CLT) under the regime where m → ∞, B → ∞ and m B → ∞ 2 .First of all, note that the linear time test statistic ηk is an average of block-based statistics ηb for b ∈ {1, ..., m B } which are independent and identically distributed.Secondly, we recall that E(η b ) = 0 for ηb is an unbiased estimator of HSIC.Finally, where the variance σ 2 k,0 is the variance of the null distributions in Expression (8) and (11) i.e. the variance of W and it is given by 3.3 Linear Time Null Distribution Estimation Expression ( 16) guarantees the Gaussianity of the null distribution of the block-based statistic and henceforth makes the computation of p-value straightforward.We simply return the test statistic and compare against the corresponding quantile of N (0, 1) which is the approach taken in [18,48,34].Note that the resulting null distribution is actually a t-distribution but with a very large number of degrees of freedom, which can be treated as a Gaussian distribution.
The difficulty of estimating the null distribution lies in estimating σ 2 k,0 .We suggest two ways to estimate such variance [34]: within-block permutation and within-block direct estimation.These two approaches are at most quadratic in B within each block which means that the computational cost of estimating the variance is of the same order as that of computing the statistic itself.
Within-block permutation can be done as follows.Within each block, we compute the test statistic using (14).At the same time, we track in parallel a sequence η * b obtained using the same formula but with {y i } m i=1 underwent a permutation.The former is used to calculate the overall block statistics and the latter is used to estimate the null variance σ2 as the independence between the samples holds by construction.
Within block direct estimation can be achieved by using (18) and the corresponding unbiased estimates of E XX ( k2 P X (X, X )) and E Y Y ( k2 P Y (Y, Y )) which can be calculated as follows.For X, the estimate of the variance for each block is given by [38]: to obtain an unbiased estimate for E XX ( k2 P X (X, X )).Similarly, replacing all x with y, we obtain an unbiased estimate for E Y Y ( k2 P Y (Y, Y )).The estimate of the variance is therefore: As remarked in [34], we note that under the null hypothesis, the approach undertaken by [48] is to estimate the null variance directly with the empirical variance of {η b } m/B b=1 .As the null variance is consistently estimated under the null hypothesis, this ensures the correct level of Type I error.However, without using the variance of the "bootstrap samples", such an estimate of the null variance will systematically overestimate as B grows with m.Hence, it will result in a reduced statistical power due to inflated p-values.
Regarding the choice of B, [48] discussed that the null distribution is close to that guaranteed by the CLT when B is small and hence the Type I error will be closer to the desired level.However, the disadvantage is the small statistical power for each given sample size.Conversely, [48] pointed out that a larger B results in a lower variance empirical null distribution and hence higher power.Hence, they suggested a sensible family of heuristics is to set B = m γ for some 0 < γ < 1.As a result, the complexity of the block-based test is O(Bm) = O(m 1+γ ).

Approximate HSIC through Primal Representations
Having discussed how we can construct a linear time HSIC test by processing the dataset in blocks, we now move on to consider how the scaling up can be done through low rank approximations of the Gram matrix.In particular, we will discuss Nyström type approximation (Section 4.1) and Random Fourier Features (RFF) type approximation (Section 4.2).Both types of approximation act directly on the primal representation of the kernel hence provide finite representations of the feature maps.
Recall that the definition of HSIC of X and Y in (4) can also be written in terms of the cross covariance operator C [15,17].Given a data set z = {(x i , y i )} m i=1 with x i ∈ R Dx and y i ∈ R Dy for all i, consider the empirical version of Ξ k X ,k Y (X, Y ) with kernels k X and k Y for X and Y respectively: where the Hilbert Schmidt norm is taken in the product space H k X ⊗ H k Y .Note, this empirical cross covariance operator is infinite dimensional.However, when approximate feature representations are used, the cross covariance operator is a finite dimensional matrix and hence the Hilbert-Schmidt norm is equivalent to the Frobenius norm (F ).
If we let φ( , the above expression can be further simplified as Hence, we obtain the expression in (7).If instead, we replace φ(x i ) and ψ(y i ) by the corresponding low-rank approximations φ( r=1 kY (•, y r ), we can obtain the approximated HSIC statistics.The details of which are provided in the following sections.

Nyström HSIC
In this section, we use the traditional Nyström approach to provide an approximation that consider the similarities between the so called inducing variables and the given dataset.We will start with a review of Nyström method and then we will provide the explicit feature map representation of the Nyström HSIC estimator.To finish, we will discuss two null distribution estimation approaches that cost at most linear in the number of samples.
The reduced-rank approximation matrix provided by Nyström method [46] represents each data point by a vector based on its kernel similarity to the inducing variables and the induced kernel matrix.The approximation is achieved by randomly sample n data points (i.e.inducing variables) from the given m samples and compute the approximate kernel matrix K ≈ K as: where each of K r,s can be think of as the r × s block of the full Gram matrix K computed using all given samples.Further, we can write Eq. 27 as: Hence, an explicit feature representation of K is obtained.Note that [37] further relaxed the setting and propose to use inducing points that are not necessarily a subset of the given data but only need to explain the dataset well for a good performance.

The Nyström HSIC Statistic
To further reduce computation cost, we propose to approximate this reducedrank kernel matrix K with the uncentered covariance matrix C that is n × n: Let us denote CX = ΦT X ΦX and CY = ΦT Y ΦY .In order to approximate the biased HSIC estimator (Eq.7) using this explicit feature map representation, the ΦX and ΦY needed to be centred.We suggest centre each column separately by subtracting its mean for both ΦX and ΦY , i.e. denote Φ. = (I m − 1 m 11 T ) Φ. = H Φ. ∈ R m×n. for X and Y respectively.
Using the methods described above, we can substitute approximated kernel functions kX = ΦX and kY = ΦY into the empirical version of where ΦX (x i ) ∈ R nx and ΦY (y i ) ∈ R ny can both be computed in linear time in m.This is the biased Nyström estimator of HSIC.Essentially, we approximate the cross covariance operator C XY by the Nyström estimator CXY = 1 m ΦT X ΦY ∈ R nx×ny which only requires O(n x n y m).In essence, the HSIC statistic computed using Nyström as we described here is a HSIC statistic computed using a different kernel.As a remark, we note that it is not immediately clear how one can choose the inducing points optimally.For the synthetic data experiments in Section 5, we simulate the inducing data from the same distribution as data X.But, we will leave the more general case as further work.

Null Distribution Estimations
Having introduced the biased Nyström HSIC statistics, we will now move on to discuss two null distribution estimation methods, namely the permutation approach and the Nyström spectral approach.The permutation approach is exactly the same as Section 2.4.1 with Ξ k X ,k Y (z) replaced by ΞNy, kX , kY (z).It is worth noting that for each permutation, we need to simulate a new set of inducing points for X and Y such that n x , n y m with m being the number of samples.
Likewise, the Nyström spectral approach is similar to that described in Section 2.4.2where eigen-decompositions of the centred Gram matrices are required to simulate the null distribution.The difference is that we approximate the centred Gram matrices using Nyström method and the HSIC V-statistic is replaced by the Nyström HSIC estimator ΞNy, kX , kY (z).So, the null distribution is then estimated using the eigenvalues from the covariance matrices ΦT X ΦX and ΦT Y ΦY .In such a way, the computational complexity is reduced from the original O(m 3 ) to O(n 3 x + n 3 y + (n 2 x + n 2 y )m + n x n y m) i.e. linear in m.

Random Fourier Feature HSIC
So far, we have looked at two large-scale approximation techniques that are applicable to any positive definite kernel.If the corresponding kernel also happens to be translation-invariant with the moment condition in (37), however, an additional popular large-scale technique can be applied: random Fourier features of [28] which is based on Bochner's representation.In this section, we will first review Bochner's theorem and subsequently build up to how random Fourier features can be used to approximate large kernel matrices.Utilising it in the context of independence testing, we propose the novel RFF HSIC estimator and further consider two null distribution estimation approaches.

Bochner's Theorem
Through the projection of data into lower dimensional randomised feature space, [28] proposed a method of converting the training and evaluation of any kernel machine into the corresponding operations of a linear machine.In particular, they showed using a randomised feature map z : where x, y ∈ R d .More specifically, [28] demonstrate the construction of feature spaces that uniformly approximate shift-invariant kernels with D = O(d −2 log 1 2 ) where is the accuracy of approximation.However, as we will see later certain moment conditions need to be satisfied.
Bochner's Theorem provides the key observation behind such approximation.This classical theorem (Theorem 6.6 in [45]) is useful in several contexts where one deals with translation-invariant kernels k, i.e. k(x, y) = κ(x − y).As well as constructing large-scale approximation to kernel methods [28], it can also be used to determine whether a kernel is characteristic i.e. if the Fourier transform of a kernel is supported everywhere then the kernel is characteristic [40].
Theorem 2 Bochner's Theorem [45] A continuous transition-invariant kernel k on R d is positive definite if and only if k(δ) is the Fourier transform of a non-negative measure.
For a properly scaled transition-invariant kernel k, the theorem guarantees that its Fourier transform Γ (w) is a non-negative measure on R d .Without loss of generality, Γ is a probability distribution.Since we would like to approximate real-valued kernel matrices, let us consider the approximation which uses only real-valued features.κ(x − y) can be written as : provided that Note, (35) follows because kernels are real valued and (36) uses the double angle formula for cosine.The random features can be computed by first sampling {w j } D j=1 i.i.d.
Here, we deal with explicit feature space, and apply linear methods to approximate the Gram matrix through the covariance matrix Z(x) T Z(x) of dimension D×D where Z(x) is the matrix of random features.Essentially, (37) guarantees that the second moment of the Fourier transform of this translational invariant kernel k to be finite and hence ensure the uniform convergence of z(x) T z(y) to κ(x − y) [28].

RFF HSIC Estimator
The derivation of the biased RFF HSIC estimator follows in the same manner as Section 4.1.1.However, with the RFF approximations of the kernel matrices, (22) becomes: where Z x ∈ R m×Dx and Z y ∈ R m×Dy .Hence, when RFF estimators are substituted, the cross covariance operator is simply a D x ×D y matrix.In the same way as the Nyström HSIC estimator, the HSIC statistic computed using RFF is a HSIC statistic computed using a different kernel, i.e. one that is induced by the random features.It is worth noting that the analysis of convergence of such estimator can possibly be done similarly to the analysis by [42] for MMD.However, we will leave this for future work.
To use the RFF HSIC statistic in independence testing, the permutation approach and spectral approach in the previous section can be adopted for null distribution estimation with ΞNy, kX , kY (z) replaced by ΞRF F, kX , kY (z).Just as the case with inducing points, the {w j } D. j=1 should be sampled each time independently for X and Y when the RFF approximations Z x and Z y needed to be computed.As a remark, the number of inducing points and the number of w j s plays a similar role in both methods which controls the trade off between computational complexity and statistical power.In practice, as we will demonstrate in the next section, such number can be much smaller than the size of the dataset without compromising the performance.

Experiments
In this section, we present three synthetic data experiments to study the behaviour of our large scale HSIC tests.The main experiment is on a challenging non-linear low signal-to-noise ratio dependence dataset to assess the numerical performance amongst the large scale HSIC tests.To investigate the performance of these test in a small scale, we further conduct linear and sine dependence experiments to compare with currently established methods for independence testing.Throughout this section, we set the significance level of the hypothesis test to be α = 0.05.Both Type I and Type II errors are calculated based on 100 trials.The 95% confidence intervals are computed based on normality assumption, i.e. μ ± 1.96 μ (1−μ)  100 , where μ is the estimate for the statistical power.

Simple Linear Experiment
We begin with an investigation of the performance of our methods on a toy example with a small number of observations, in order to check the agreements between large scale approximation methods we proposed and the exact methods where they are still feasible.Towards this aim, we consider a simple linear dependence experiment, but where the dependence between the response Y and the input X is in only a single dimension of X.In particular, X ∼ N (0, I d ) and Y = X 1 + Z where d is the dimensionality of data vector X and X 1 indicate the first dimension of X.The noise Z is independent standard Gaussian noise.We would like to compare methods based on HSIC to a method based on Pearson's Correlation which is explicitly aimed at linear dependence and should give strongest performance.However, as the latter cannot be directly applied to multivariate data, we consider a SubCorr statistic: SubCorr = 1 where Corr(Y, X i ) is the Pearson's correlation between Y and the i th dimension of X .In addition, we will also consider SubHSIC statistic: SubHSIC= For these two methods, we will use a permutation approach as their distributions are not immediately clear.Fig. 1 Simple linear experiment for d = 10.Left: comparing HSIC spectral approach with Nyström spectral method and RFF spectral method.Right: HSIC spectral approach with SubHSIC and SubCorr.
In Fig 1, the dimension of X is set to be 10.Both the number of random features in RFF and the number of inducing variables in Nyström are set to 10.We do not use the block-based method as the sample sizes are small.From Fig 1 (right), we see that SubCorr yields the highest power as expected.HSIC and SubHSIC with Gaussian median-heuristic kernels perform similarly though, with all three giving the power of 1 at the sample size of 100.On the other hand, Fig 1 (left) shows that the two large scale methods are still able to detect the dependence at these small sample sizes, even though there is some loss in power in comparison to HSIC and they would require a larger sample size.As we will see, this requirement for a larger sample size will be offset by a much lower computational cost in large-scale examples.

Sine Dependence Experiment
We now consider a more challenging nonlinear dependence experiment to investigate time vs power tradeoffs of large-scale tests.The dataset consists of a sample of size m generated i.i.d.according to: where d is the dimensionality of data vector X, X i indicates the i th dimension of X and Z ∼ N (0, 1).In addition to HSIC and its large scale versions, we will also consider dCor [44,43] -which can be formulated in terms of HSIC using Brownian Kernel with parameter H = 0.5 [33, Appendix A].In addition, we will consider dCor using the Gaussian kernel with median heuristic bandwidth parameter.The number of random Fourier features and the number of inducing variables are both set to 50.For RFF, we use the Gaussian kernel with median heuristic bandwidth parameter, while for Nyström we in addition use the Brownian kernel with H = 0.5 (note that RFF is not applicable to this kernel as it is not translation-invariant).At these still relatively small sample sizes, block-based approach gave poor performance.From Fig. 2, dCor clearly outperforms the other methods with the Brownian kernel Nyström method giving the closest performance in terms of power.Reassuringly, the four methods using Gaussian kernel all give very similar power performance.Fig. 3, however, tells a very different story -the large-scale methods all reach the power of 1 in a test time which is several orders of magnitude smaller, demonstrating the utility of the introduced tests.

Large Scale Experiment
We would now like to more closely compare the performance of the proposed large scale HSIC tests with each other -at sample sizes where standard HSIC / dCor approaches are no longer feasible.We consider a challenging non-linear and low signal-to-noise ratio experiment, where a sample of size m is generated i.i.d.according to: where d is the dimensionality of the data set X and Z ∼ N (0, I d 2 +1 ).Note that Y is independent of each individual dimension of X and that the dependence is non-linear.For d = 50 and 100, we would like to explore the relationship between the test power across a different number of samples m = {10 5 , 2 × 10 5 , 5 × 10 5 , 10 6 , 2 × 10 6 , 5 × 10 6 , 10 7 }.The number of random features, inducing variables and block size are all set to 200 so that their computational cost is comparable.Gaussian RBF kernel with median heuristic is used in all cases.For RFF and Nyström methods, we used the spectral approach to estimate the null distribution.Fig. 4 is a plot of the test power against the number of samples whereas Fig. 5 is a plot of the test power against average testing time.It is clear that for both d = 50 and d = 100, the RFF method gives the best performance in power for a fixed number of samples, followed by the Nyström method and then by the block-based approach.The RFF method is able to achieve zero type II error (i.e.no failure to reject a false null) with 5×10 4 samples for d = 50 and 5×10 5 samples for d = 100, while the Nyström method has a 80% false negative rate at these sample sizes.The power vs time plot in 5 gives a similar picture as Fig. 4 confirming the superiority of the RFF method on this example.100 , where μ is the estimate for the statistical power.

Discussion and Conclusions
We have proposed three novel large scale estimators of HSIC, a kernel-based nonparametric dependence measure -these are the block-based estimator, the Nyström estimator and the RFF estimator.We subsequently established suitable independence testing procedures for each method -by taking advantage of the normal asymptotic null distribution of the block-based estimator and by employing an approach that directly estimates the eigenvalues appearing in the asymptotic null distribution for the Nyström and RFF methods.All three tests significantly reduce computational complexity in memory and time over the standard HSIC-based test.We verified the validity of our largescale testing methods and its favourable tradeoffs between testing power and computational complexity on challenging high-dimensional synthetic data.We have observed that RFF and Nyström approaches have considerable advantages over the block-based test.Several further extensions can be studied: the developed large-scale approximations are readily applicable to three-variable interaction testing [32], conditional independence testing [13] as well as application in causal discovery [49,12].Moreover, the RFF HSIC approach can be extended using the additional smoothing of characteristic function representations similarly to the approach of [9] in the context of two-sample testing.
where kP X (x, x ) and kP Y (y, y ) are defined in (10).
In fact, we can prove that the expectation of ( 42) is four times the HSIC of X and Y .To see this, we first need to show that such expectation is well defined.Indeed, note that for a valid semimetric Hence, replacing the z i in g 1 with X i and the z i in g 2 with Y i , |h((X 1 , Y 1 ), (X 2 , Y 2 ), ..., (X 6 , Y 6 ))| ≤ g 1 (X 1 , X 2 , X 3 , X 4 )g 2 (Y 1 , Y 2 , Y 5 , Y 6 ).
Since the marginal distributions have finite first moments with respect to the kernels, then each of the terms in g 1 and g 2 is integrable and hence g 1 and g 2 are integrable.Moreover, since the marginal distributions have finite second moments with respect to the kernels, then the joint distribution satisfies P XY ∈ M 1 k X ⊗k Y (X × Y).Therefore, h is integrable.Subsequently, by taking the expectation and utilising Fubini's theorem, we obtain that E(h((X 1 , Y 1 ), (X 2 , Y 2 ), ..., (X 6 , Y 6 ))) = 4E( kP X (X, X ) kP Y (Y, Y )) (46) which is 4 times the HSIC.
In order to use the theory of degenerate V-statistics to obtain the asymptotic distribution, we need to consider the symmetrised version of h, which we define as follows h((X 1 , Y 1 ), ..., (X 6 , Y 6 )) := 1 6! σ∈A h((X σ (1) , Y σ(1) ), ..., (X σ (6) , Y σ (6) )) where A denotes the set of all permutations of {1, ..., 6}.Then, under the null hypothesis of independence P XY = P X × P Y , h2 ((x, y), (x , y )) : = E[ h((x, y), (x , y ), (X 3 , Y 3 ), ..., (X 6 , Y 6 ))] = 4 15 kµ(x, x ) kν (y, y ) If we fix the first two positions to be {1, 2} and randomly permute the rest, we obtain 24 different combinations.Similarly if we fix the first two positions to be {2, 1}, we also obtain 24 different combinations.Some algebraic manipulation shows that these are the combinations that gives the expectation of h to be kµ(x, x ) kν (y, y ).In fact, these are the only combinations as when either {1} or {2} or both are not in the first two positions, the expectation of h is zero and all terms cancel out.
Hence, by Theorem B in Chapter 6 of [35], which says Note, since under the null hypothesis P XY = P X × P Y , the above operator is given by the tensor product of S kP X and S kP Y (Remark 2.9 [24]).Therefore {γr} ∞ r=1 are the products of the eigenvalues of these two operators.By noting that 1 m 2 i,j kP X (x i , x j ) kP Y (y i , y j ) is exactly Ξ b,k X ,k Y (Z), i.e. the V-statistics with the kernel h2 ((x, y), (x , y )) We obtained the desired asymptotic distribution.

Fig. 3
Fig.3The corresponding average testing time plot for the sine dependence experiment for d = 2.

Fig. 5
Fig. 5 Large Scale Experiment: The average testing time comparison between the three large scale independence testing methods.Dotted line: d = 50; solid line: d = 100.

2 r 1 )
x i , y i ), (x j , y j )) as the sample size m → ∞, we obtain that m kP X (x i , x j ) kP Y (y i , y j ) ∀r and {γr} ∞ r=1 are the eigenvalues of the operatorS k : L 2 θ (X × Y) → L 2θ (X × Y) defined as:S k g(x, y) = X ×YkP X (x, x ) kP Y (y, y )g(x , y )dθ(x , y )