Largescale kernel methods for independence testing
 2.5k Downloads
 8 Citations
Abstract
Representations of probability measures in reproducing kernel Hilbert spaces provide a flexible framework for fully nonparametric hypothesis tests of independence, which can capture any type of departure from independence, including nonlinear associations and multivariate interactions. However, these approaches come with an at least quadratic computational cost in the number of observations, which can be prohibitive in many applications. Arguably, it is exactly in such largescale datasets that capturing any type of dependence is of interest, so striking a favourable tradeoff between computational efficiency and test performance for kernel independence tests would have a direct impact on their applicability in practice. In this contribution, we provide an extensive study of the use of largescale kernel approximations in the context of independence testing, contrasting blockbased, Nyström and random Fourier feature approaches. Through a variety of synthetic data experiments, it is demonstrated that our largescale methods give comparable performance with existing methods while using significantly less computation time and memory.
Keywords
Independence testing Largescale kernel method Hilbert–Schmidt independence criteria Random Fourier features Nyström method1 Introduction
Given a paired sample \({\mathbf {z}} = \{ (x_i,y_i) \}^m_{i=1}\) with each \((x_i, y_i)\in {\mathcal {X}} \times {\mathcal {Y}}\) independently and identically following the joint distribution \(P_{XY}\) on some generic domains \({\mathcal {X}}\) and \({\mathcal {Y}}\), the nonparametric independence problem consists of testing whether we should reject the null hypothesis \({\mathcal {H}}_0: P_{XY} = P_X P_Y\) in favour of the general alternative hypothesis \({\mathcal {H}}_1: P_{XY} \not = P_X P_Y\), where \(P_X\) and \(P_Y\) are the marginal distributions for X and Y, respectively. This problem is fundamental and extensively studied, with wideranging applications in statistical inference and modelling. Classical dependence measures, such as Pearson’s product–moment correlation coefficient, Spearman’s \(\rho \), Kendall’s \(\tau \) or methods based on contingency tables are typically designed to capture only particular forms of dependence (e.g. linear or monotone). Furthermore, they are applicable only to scalar random variables or require space partitioning limiting their use to relatively low dimensions. As availability of larger datasets also facilitates building more complex models, dependence measures are sought that capture more complex dependence patterns and those that occur between multivariate and possibly highdimensional datasets. In this light, amongst the most popular dependence measures recently have been those based on characteristic functions (Székely et al. 2007; Székely and Rizzo 2009) as well as a broader framework based on kernel methods (Gretton et al. 2005, 2008). A desirable property of consistency against any alternative—i.e. test power provably increasing to one with the sample size regardless of the form of dependence, is warranted for statistical tests based on such approaches. However, this is achieved at an expense of computational and memory requirements that increase at least quadratically with the sample size, which is prohibitive for many modern applications. Thus, a natural question is whether a favourable tradeoff between computational efficiency and test power can be sought with appropriate largescale approximations. As we demonstrate, several largescale approximations are available in this context and they lead to strong improvements in powerpercomputatonal unit performance, resulting in a fast and flexible independence testing framework responsive to all forms of dependence and applicable to large datasets.
The key quantity we consider is the Hilbert–Schmidt independence criterion (HSIC) introduced by Gretton et al. (2005). HSIC uses the distance between the kernel embeddings of probability measures in the reproducing kernel Hilbert space (RKHS) (Gretton et al. 2008; Zhang et al. 2011; Smola et al. 2007). By building on decades of research into kernel methods for machine learning (Schölkopf and Smola 2002), HSIC can be applied to multivariate observations as well as to those lying in nonEuclidean and structured domains, e.g. Gretton et al. (2008) considers independence testing on text data. HSIC has also been applied to clustering and learning taxonomies (Song et al. 2007; Blaschko and Gretton 2009), feature selection (Song et al. 2012), causal inference (Peters et al. 2014; Flaxman et al. 2015; Zaremba and Aste 2014) and computational linguistics (Nguyen and Eisenstein 2016). A closely related dependence coefficient that measures all types of dependence between two random vectors of arbitrary dimensions is the distance covariance (dCov) of Székely et al. (2007), Székely and Rizzo (2009), which measures distances between empirical characteristic functions or equivalently measures covariances with respect to a stochastic process (Székely and Rizzo 2009), and its normalised counterpart, distance correlation (dCor). RKHSbased dependence measures like HSIC are in fact extensions of dCov—Sejdinovic et al. (2013b) shows that dCov can be understood as a form of HSIC with a particular choice of kernel. Moreover, dCor can be viewed as an instance of kernel matrix alignment of Cortes et al. (2012). As we will see, statistical tests based on estimation of HSIC and dCov are computationally expensive and require at least \({\mathcal {O}}(m^2)\) time and storage complexity, where m is the number of observations, just to compute an HSIC estimator which serves as a test statistic. In addition, the complicated form of the asymptotic null distribution of the test statistics necessitates either permutation testing (Arcones and Gine 1992) (further increasing the computational cost) or even more costly direct sampling from the null distribution, requiring eigendecompositions of kernel matrices using the spectral test of Gretton et al. (2009), with a cost of \({\mathcal {O}}(m^3)\).^{1} These memory and time requirements often make the HSICbased tests infeasible for practitioners.
In this paper, we consider several ways to speed up the computation in HSICbased tests. More specifically, we introduce three fast estimators of HSIC: the blockbased estimator, the Nyström estimator and the random Fourier feature (RFF) estimator and study the resulting independence tests. In the blockbased setting, we obtain a simpler asymptotic null distribution as a consequence of the central limit theorem in which only asymptotic variance needs to be estimated—we discuss possible approaches for this. RFF and Nyström estimators correspond to the primal finitedimensional approximations of the kernel functions and as such also warrant estimation of the null distribution in linear time—we introduce a novel spectral tests based on eigendecompositions of primal covariance matrices, which avoid permutation approach and significantly reduce the computational expense for the direct sampling from the null distribution.
1.1 Related work
Some of the approximation methods considered in this paper were inspired by their use in a related context of twosample testing. In particular, the blockbased approach for twosample testing was studied in Gretton et al. (2012b, 2012a), Zaremba et al. (2013) under the name of linear time MMD (maximum mean discrepancy), i.e. the distance between the mean embeddings of the probability distributions in the RKHS. The approach estimates MMD on a small block of data and then averages the estimates over blocks to obtain the final test statistic. Our blockbased estimator of HSIC follows exactly the same strategy. On the other hand, The Nyström method (Williams and Seeger 2001; Snelson and Ghahramani 2006) is a classical lowrank kernel approximation technique, where data are projected into lowerdimensional subspaces of RKHS (spanned by socalled inducing variables). Such an idea is popular in fitting sparse approximations to Gaussian process (GP) regression models, allowing reduction in the computational cost from \({\mathcal {O}}(m^3)\) to \({\mathcal {O}}(n^2m)\) where \(n \ll m\) is the number of inducing variables. To the best of our knowledge, Nyström approximation was not studied in the context of hypothesis testing. Random Fourier feature (RFF) approximations (Rahimi and Recht 2007), however, due to their relationship with evaluations of empirical characteristic functions, do have a rich history in the context of statistical testing—as discussed in Chwialkowski et al. (2015), which also proposes an approach to scale up kernelbased twosample tests by additional smoothing of characteristic functions, thereby improving the test power and its theoretical properties. Moreover, the approximation strategy of MMD and twosample testing through primal representation using RFF have also been studied in Zhao and Meng (2015), Sutherland and Schneider (2015), LopezPaz (2016). In addition, LopezPaz et al. (2013) first proposed the idea of applying RFF in order to construct an approximation to a kernelbased dependence measure. More specifically, they develop randomised canonical correlation analysis (RCCA) (see also LopezPaz 2014, 2016) approximating the nonlinear kernelbased generalisation of the canonical correlation analysis (Lai and Fyfe 2000; Bach and Jordan 2002) and using a further copula transformation, construct a test statistic termed RDC (randomised dependence coefficient) requiring \(O(m\log m)\) time to compute. Under suitable assumptions, Bartlett’s approximation (Mardia et al. 1979) provides a closed form expression for the asymptotic null distribution of this statistic which further results in a distributionfree test, leading to an attractive option for largescale independence testing. We extend these ideas based on RFF to construct approximations of HSIC and dCov/dCor, which are conceptually distinct kernelbased dependence measures from that of kernel CCA, i.e. they measure different types of norms of RKHS operators (operator norm vs Hilbert–Schmidt norm).
In fact, the Nyström and RFF approximations can also be viewed through the lense of nonlinear canonical analysis framework introduced by Dauxois and Nkiet (1998). This is the earliest example we know where nonlinear dependence measures based on spectra of appropriate Hilbert space operators are studied. In particular, the crosscorrelation operator with respect to a dictionary of basis functions in \(L_2\) (e.g. Bsplines) is considered in Dauxois and Nkiet (1998). Huang et al. (2009) links this framework to the RKHS perspective. The functions of the spectra that were considered in Dauxois and Nkiet (1998) are very general, but the simplest one (sum of the squared singular values) can be recast as the normalised crosscovariance operator (NOCCO) of Fukumizu et al. (2008), which considers the Hilbert–Schmidt norm of the crosscorrelation operator on RKHSs and as such extends kernel CCA to consider the entire spectrum. While in this work we focus on HSIC (Hilbert–Schmidt norm of the crosscovariance operator), which is arguably the most popular kernel dependence measure in the literature, a similar Nyström or RFF approximation can be applied to NOCCO as well—we leave this as a topic for future work.
The paper is structured as follows: in Sect. 2, we first provide some necessary definitions from the RKHS theory and review the aforementioned Hilbert–Schmidt independence criterion (HSIC) and discuss its biased and unbiased quadratic time estimators. Then, Sect. 2.3 gives the asymptotic null distributions of estimators (proofs provided in Section A). In Sect. 3, we develop a blockbased HSIC estimator and derive its asymptotic null distribution. Following this, a linear time asymptotic variance estimation approach is proposed. In Sects. 4.1 and 4.2, we propose Nyström HSIC and RFF HSIC estimator, respectively, both with the corresponding linear time null distribution estimation approaches. Finally, in Sect. 5, we explore the performance of the three testing approaches on a variety of challenging synthetic data.
2 Background
This section starts with a brief overview of the key concepts and notation required to understand the RKHS theory and kernel embeddings of probability distributions into the RKHS. It then provides the definition of HSIC which will serve as a basis for later independence tests. We review the quadratic time biased and unbiased estimators of HSIC as well as their respective asymptotic null distributions. As the final part of this section, we outline the construction of independence tests in quadratic time.
2.1 RKHS and embeddings of measures
Let \(\mathcal {Z }\) be any topological space on which Borel measures can be defined. By \(\mathcal {M(Z)}\) we denote the set of all finitesigned Borel measures on \({\mathcal {Z}}\) and by \({\mathcal {M}}^1_+ ({\mathcal {Z}})\) the set of all Borel probability measures on \({\mathcal {Z}}\). We will now review the basic concepts of RKHS and kernel embeddings of probability measures. For further details, see Berlinet and ThomasAgnan (2004), Steinwart and Christmann (2008), Sriperumbudur (2010).
Definition 1
 1.
\(\forall z \in {\mathcal {Z}}, k(\cdot ,z) \in {\mathcal {H}}\)
 2.
\(\forall z \in {\mathcal {Z}}, \forall f \in {\mathcal {H}}, \langle f,k(\cdot ,z) \rangle _{\mathcal {H}} = f(z).\)

Linear kernel: \(k(x,y) = x^T y\);

Polynomial kernel of degree \(d \in {\mathbb {N}}\): \(k(x,y) = (x^T y + 1)^d \);

Gaussian kernel with bandwidth \(\sigma > 0\): \(k(x,y) = \exp (\frac{\Vert x y\Vert ^2}{2\sigma ^2})\);

Fractional Brownian motion covariance kernel with parameter \(h\in (0,1)\): \(k(x,y)=\frac{1}{2}\left( \Vert x\Vert ^{2h}+\Vert y\Vert ^{2h}\,\Vert xy\Vert ^{2h}\right) \)
Definition 2
It is understood from this definition that the integral of any RKHS function f under the measure \(\nu \) can be evaluated as the inner product between f and the kernel embedding \(\mu _k (\nu )\) in the RKHS \( {\mathcal {H}}_k.\) As an alternative, the kernel embedding can be defined through the use of Bochner integral \(\mu _k (\nu ) = \int k(\cdot ,z) \mathrm{d}\nu (z)\). Any probability measure is mapped to the corresponding expectation of the canonical feature map. By Cauchy–Schwarz inequality and the Riesz representation theorem, a sufficient condition for the existence of an embedding of \(\nu \) is that \(\nu \in {\mathcal {M}}^{1/2}_{k}({\mathcal {Z}})\), where we adopt notation from Sejdinovic et al. (2013b): \({\mathcal {M}}^\theta _{k}({\mathcal {Z}}) = \left\{ \nu \in {\mathcal {M}}(Z): \int k^\theta (z,z) \mathrm{d}\nu (z) < \infty \right\} \), which is, e.g. satisfied for all finite measures if k is a bounded function (such as Gaussian kernel).
Embeddings allow measuring distances between probability measures, giving rise to the notion of maximum mean discrepancy (MMD) (Borgwardt et al. 2006; Gretton et al. 2012b).
Definition 3
Definition 4
HSIC is well defined whenever \(P_X \in {\mathcal {M}}^1_{k_{{\mathcal {X}}}}({\mathcal {X}})\) and \(P_Y \in {\mathcal {M}}^1_{k_{{\mathcal {Y}}}}({\mathcal {Y}})\) as this implies \(P_{XY} \in {\mathcal {M}}^{1/2}_{k_{{\mathcal {X}}}\otimes k_{{\mathcal {Y}}}}({\mathcal {X}} \times {\mathcal {Y}})\) (Sejdinovic et al. 2013b). The name of HSIC comes from the operator view of the RKHS \({\mathcal {H}}_{k_{\mathcal {X}}\otimes k_{\mathcal {Y}}}\). Namely, the difference between embeddings \({\mathbb {E}}_{XY}[k_{\mathcal {X}}(.,X) \otimes k_{\mathcal {Y}}(.,Y)]  {\mathbb {E}}_X k_{\mathcal {X}}(.,X) \otimes {\mathbb {E}}_Y k_{\mathcal {Y}}(.,Y)\) can be identified with the crosscovariance operator \(C_{XY}:{\mathcal {H}}_{k_{\mathcal {Y}}}\rightarrow {\mathcal {H}}_{k_{\mathcal {X}}}\) for which \(\langle f,C_{XY}g\rangle _{{\mathcal {H}}_{k_{\mathcal {X}}}}={\text {Cov}}\left[ f(X)g(Y)\right] \), \(\forall f\in {\mathcal {H}}_{k_{\mathcal {X}}},g\in {\mathcal {H}}_{k_{\mathcal {Y}}}\) (Gretton et al. 2005, 2008). HSIC is then simply the squared Hilbert–Schmidt norm \(\Vert C_{XY}\Vert _{HS}^2\) of this operator, while distance correlation (dCor) of Székely et al. (2007), Székely and Rizzo (2009) can be cast as \(\Vert C_{XY}\Vert _{HS}^2 / \Vert C_{XX}\Vert _{HS}\Vert C_{YY}\Vert _{HS}\) (Sejdinovic et al. 2013b, Appendix A). In the sequel, we will suppress dependence on kernels \(k_{\mathcal {X}}\) and \(k_{\mathcal {Y}}\) in notation \(\varXi _{k_{\mathcal {X}},k_{\mathcal {Y}}}(X,Y)\) where there is no ambiguity.
Repeated application of the reproducing property gives the following equivalent representation of HSIC (Smola et al. 2007):
Proposition 1
2.2 Estimation of HSIC
2.3 Asymptotic null distribution of estimators
The asymptotic null distribution of the biased HSIC statistic defined in (9) computed using a given dataset converges in distribution in Theorem 1 below. This asymptotic distribution builds the theoretical foundation for the spectral testing approach (described in Sect. 2.4.2) that we will use throughout the paper.
Theorem 1
We note that Chwialkowski and Gretton (2014) (Lemma 2 and Theorem 1) proves a more general result, applicable to dependent observations under certain mixing conditions where the i.i.d. setting is a special case. Moreover, Rubenstein et al. (2015) (Theorem 5 and 6) provides another elegant proof in the context of threevariable interaction testing from Sejdinovic et al. (2013a). However, both Chwialkowski and Gretton (2014) and Rubenstein et al. (2015) assume boundedness of \(k_{\mathcal {X}}\) and \(k_{\mathcal {Y}}\), while our proof in Appendix assumes a weaker condition of finite second moments for both \(k_{\mathcal {X}}\) and \(k_{\mathcal {Y}}\), thus making the result applicable to unbounded kernels such as the Brownian motion covariance kernel.
2.4 Quadratic time null distribution estimations
We would like to design independence tests with an asymptotic Type I error of \(\alpha \) and hence we need an estimate of the \((1\alpha )\) quantile of the null distribution. Here, we consider two frequently used approaches, namely the permutation approach and the spectral approach, that require at least quadratic time both in terms of memory and computation time. The biased Vstatistic will be used because of its neat and compact formulation.
2.4.1 Permutation approach
Consider an iid sample \({\mathbf {z}} = \left\{ (x_i,y_i)\right\} ^m_{i=1}\) with chosen kernels \(k_{{\mathcal {X}}}\) and \(k_{{\mathcal {Y}}}\), respectively, the permutation/bootstrap approach Arcones and Gine (1992) proceed in the following manner. Suppose the total number of shuffles is fixed at \(N_p\), we first compute \(\varXi _{k_{\mathcal {X}},k_{\mathcal {Y}}}({\mathbf {z}})\) using \({\mathbf {z}}\), \(k_{{\mathcal {X}}}\) and \(k_{{\mathcal {Y}}}\). Then, for each shuffle, we fix the \(\{ x_i \}^m_{i=1}\) and randomly permute the \(\{ y_i \}^m_{i=1}\) to obtain \({\mathbf {z}}^* = \left\{ (x_i,y^*_i)\right\} ^m_{i=1}\) and subsequently compute \(\varXi ^*_{k_{\mathcal {X}},k_{\mathcal {Y}}}({\mathbf {z^*}})\). The onesided pvalue in this instance is the proportion of HSIC values computed on the permuted data that are greater than or equal to \(\varXi _{k_{\mathcal {X}},k_{\mathcal {Y}}}({\mathbf {z}})\).
The computational time is \({\mathcal {O}}\)(number of shuffles \(\times m^2)\) for this approach, where the number of shuffles determines the extend to which we have explored the sampling distribution. In other words, a small number of shuffles means that we may only obtained realisations from the mode of the distribution and hence the tail structure is not adequately captured. Although a larger number of shuffles ensures the proper exploration of the sampling distribution, the computation cost can be high.
2.4.2 Spectral approach
The spectral approach (Gretton et al. 2009; Zhang et al. 2011) requires that we first calculate the centred Gram matrices \(\widetilde{K}_X = HK_{X}H\) and \(\widetilde{K}_Y = HK_{Y}H\) for the chosen kernel \(k_{{\mathcal {X}}}\) and \(k_{{\mathcal {Y}}}\). Then, we compute the \(m \varXi _{b,k_{\mathcal {X}},k_{\mathcal {Y}}}({\mathbf {z}})\) statistics according to (9). Next, the spectrums (i.e. eigenvalues) \(\{\lambda _i\}^m_{i=1}\) and \(\{\eta _i\}^m_{i=1}\) of \(\widetilde{K}_X\) and \(\widetilde{K}_Y\) are, respectively, calculated. The empirical null distribution can be simulated by simulating a large enough i.i.d samples from the standard Normal distribution (Zhang et al. 2011) and then generate the test statistic according to (10). Finally, the p value is computed by calculating the proportion of simulated samples that are greater than or equal to the observed \(m \varXi _{b,k_{\mathcal {X}},k_{\mathcal {Y}}}({\mathbf {z}})\) value.
Additionally, Zhang et al. (2011) has provided an approximation to the null distribution with a twoparameter Gamma distribution. Despite the computational advantage of such an approach, the permutation and spectral approaches are still preferred since there is no consistency guarantee for the Gamma distribution approach.
3 Blockbased HSIC
The quadratic time test statistics are prohibitive for large dataset as it requires \({\mathcal {O}}(m^2)\) time in terms of storage and computation. Furthermore, one requires an approximation of the asymptotic null distribution in order to compute the p value. As we discussed in the previous section, this is usually done by randomly permute the Y observations (i.e. the permutation approach) or by performing an eigendecomposition of the centred kernel matrices for X and Y (i.e. the spectral approach). Both approaches are expensive in terms of memory and can be computationally infeasible. In this section, we propose a blockbased estimator of HSIC which reduce the computational time to linear in the number of samples. The asymptotic null distribution of this estimator will be shown to have a simple form as a result of the central limit theorem (CLT).
3.1 The block HSIC statistic
3.2 Null distribution of blockbased HSIC
3.3 Linear time null distribution estimation
Expression (18) guarantees the Gaussianity of the null distribution of the blockbased statistic and, henceforth, makes the computation of pvalue straightforward. We simply return the test statistic \(\sqrt{mB}\frac{ \hat{\eta }_{b}}{\sqrt{\hat{\sigma }^2_{k,0}}}\) and compare against the corresponding quantile of \({\mathcal {N}}(0,1 )\) which is the approach taken in Gretton et al. (2012a), Zaremba et al. (2013), Sejdinovic et al. (2014). Note that the resulting null distribution is actually a tdistribution but with a very large number of degrees of freedom, which can be treated as a Gaussian distribution.
The difficulty of estimating the null distribution lies in estimating \(\sigma ^2_{k,0}\). We suggest two ways to estimate such variance (Sejdinovic et al. 2014): withinblock permutation and withinblock direct estimation. These two approaches are at most quadratic in B within each block which means that the computational cost of estimating the variance is of the same order as that of computing the statistic itself.
Withinblock permutation can be done as follows. Within each block, we compute the test statistic using (16). At the same time, we track in parallel a sequence \(\hat{\eta }^*_{b}\) obtained using the same formula but with \(\{y_i\}^m_{i=1}\) underwent a permutation. The former is used to calculate the overall block statistics and the latter is used to estimate the null variance \(\hat{\sigma }^2_{k,0} = B^2 {\mathbb {V}}ar [ \{\hat{\eta }^*_{b} \}^{m/B}_{b=1} ]\) as the independence between the samples holds by construction.
Regarding the choice of B, Zaremba et al. (2013) discussed that the null distribution is close to that guaranteed by the CLT when B is small, and hence, the Type I error will be closer to the desired level. However, the disadvantage is the small statistical power for each given sample size. Conversely, Zaremba et al. (2013) pointed out that a larger B results in a lower variance empirical null distribution and hence higher power. Hence, they suggested a sensible family of heuristics is to set \(B=\lfloor m^{\gamma }\rfloor \) for some \( 0<\gamma <1\). As a result, the complexity of the blockbased test is \({\mathcal {O}}(Bm) = {\mathcal {O}}(m^{1+\gamma })\).
4 Approximate HSIC through primal representations
Having discussed how we can construct a linear time HSIC test by processing the dataset in blocks, we now move on to consider how the scaling up can be done through lowrank approximations of the Gram matrix. In particular, we will discuss Nyström type approximation (Sect. 4.1) and random Fourier features (RFF) type approximation (Sect. 4.2). Both types of approximation act directly on the primal representation of the kernel hence provide finite representations of the feature maps.
4.1 Nyström HSIC
In this section, we use the traditional Nyström approach to provide an approximation that consider the similarities between the socalled inducing variables and the given dataset. We will start with a review of Nyström method and then we will provide the explicit feature map representation of the Nyström HSIC estimator. To finish, we will discuss two null distribution estimation approaches that cost at most linear in the number of samples.
4.1.1 The Nyström HSIC statistic
4.1.2 Null distribution estimations
Having introduced the biased Nyström HSIC statistics, we will now move on to discuss two null distribution estimation methods, namely the permutation approach and the Nyström spectral approach. The permutation approach is exactly the same as Sect. 2.4.1 with \(\varXi _{k_{\mathcal {X}},k_{\mathcal {Y}}}({\mathbf {z}})\) replaced by \(\hat{\varXi }_{Ny, \tilde{k}_{\mathcal {X}},\tilde{k}_{\mathcal {Y}}}({\mathbf {z}})\). It is worth noting that for each permutation, we need to simulate a new set of inducing points for X and Y such that \(n_x, n_y \ll m\) with m being the number of samples.
Likewise, the Nyström spectral approach is similar to that described in Sect. 2.4.2 where eigendecompositions of the centred Gram matrices are required to simulate the null distribution. The difference is that we approximate the centred Gram matrices using Nyström method and the HSIC Vstatistic is replaced by the Nyström HSIC estimator \(\hat{\varXi }_{Ny, \tilde{k}_{\mathcal {X}},\tilde{k}_{\mathcal {Y}}}({\mathbf {z}})\). So, the null distribution is then estimated using the eigenvalues from the covariance matrices \(\tilde{\varPhi }_X^T \tilde{\varPhi }_X \) and \( \tilde{\varPhi }_Y^T \tilde{\varPhi }_Y \). In such a way, the computational complexity is reduced from the original \({\mathcal {O}}(m^3)\) to \({\mathcal {O}}(n_x^3+n_y^3+(n_x^2+n_y^2)m + n_xn_ym)\) i.e. linear in m.
4.2 Random Fourier feature HSIC
So far, we have looked at two largescale approximation techniques that are applicable to any positivedefinite kernel. If the corresponding kernel also happens to be translation invariant with the moment condition in (39); however, an additional popular largescale technique can be applied: random Fourier features of Rahimi and Recht (2007) which is based on Bochner’s representation. Note that many other kernels though not translational invariant, e.g. arccosine kernel, are also universal and approximable by random features (Cho and Saul 2009). In this section, we will first review Bochner’s theorem and subsequently build up to how random Fourier features can be used to approximate large kernel matrices. Utilising it in the context of independence testing, we propose the RFF HSIC estimator and further consider two null distribution estimation approaches.
4.2.1 Bochner’s theorem
Bochner’s theorem provides the key observation behind such approximation. This classical theorem (Theorem 6.6 in Wendland 2005) is useful in several contexts where one deals with translationinvariant kernels k, i.e. \(k(x,y) = \kappa (xy)\). As well as constructing largescale approximation to kernel methods (Rahimi and Recht 2007), it can also be used to determine whether a kernel is characteristic, i.e. if the Fourier transform of a kernel is supported everywhere then the kernel is characteristic (Sriperumbudur 2010).
Theorem 2
Bochner’s theorem (Wendland 2005) A continuous transitioninvariant kernel k on \({\mathcal {R}}^d\) is positive definite if and only if \(k(\delta )\) is the Fourier transform of a nonnegative measure.
Here, we deal with explicit feature space and apply linear methods to approximate the Gram matrix through the covariance matrix \(Z(x)^TZ(x)\) of dimension \(D \times D\) where Z(x) is the matrix of random features. Essentially, (39) guarantees that the second moment of the Fourier transform of this translational invariant kernel k to be finite and hence ensure the uniform convergence of \(z(x)^Tz(y)\) to \(\kappa (xy)\) (Rahimi and Recht 2007).
4.2.2 RFF HSIC estimator
To use the RFF HSIC statistic in independence testing, the permutation approach and spectral approach in the previous section can be adopted for null distribution estimation with \(\hat{\varXi }_{Ny, \tilde{k}_{\mathcal {X}},\tilde{k}_{\mathcal {Y}}}({\mathbf {z}})\) replaced by \(\hat{\varXi }_{RFF, \tilde{k}_{\mathcal {X}},\tilde{k}_{\mathcal {Y}}}({\mathbf {z}})\). Just as the case with inducing points, the \(\{w_j\}_{j=1}^{D_.}\) should be sampled each time independently for X and Y when the RFF approximations \(Z_x\) and \(Z_y\) needed to be computed. As a remark, the number of inducing points and the number of \(w_j\)s plays a similar role in both methods which controls the tradeoff between computational complexity and statistical power. In practice, as we will demonstrate in the next section, such number can be much smaller than the size of the dataset without compromising the performance.
5 Experiments
In this section, we present three synthetic data experiments to study the behaviour of our largescale HSIC tests. The main experiment is on a challenging nonlinear low signaltonoise ratio dependence dataset to assess the numerical performance amongst the largescale HSIC tests. To investigate the performance of these test in a small scale, we further conduct linear and sine dependence experiments to compare with currently established methods for independence testing. Throughout this section, we set the significance level of the hypothesis test to be \(\alpha = 0.05\). Both Type I and Type II errors are calculated based on 100 trials. The 95% confidence intervals are computed based on normality assumption, i.e. \(\hat{\mu } \pm 1.96 \sqrt{\frac{\hat{\mu } (1 \hat{\mu })}{100}}\), where \(\hat{\mu }\) is the estimate for the statistical power. Unless otherwise stated, HSIC, RFF and Nyström approaches are all using Gaussian kernel with median heuristic.
5.1 Simple linear experiment
In Fig. 1, the dimension of X is set to be 10. Both the number of random features in RFF and the number of inducing variables in Nyström are set to 10. We do not use the blockbased method as the sample sizes are small. From Fig. 1 (right), we see that SubCorr yields the highest power as expected. HSIC and SubHSIC with Gaussian median heuristic kernels perform similarly though, with all three giving the power of 1 at the sample size of 100. On the other hand, Fig. 1 (left) shows that the two largescale methods are still able to detect the dependence at these small sample sizes, even though there is some loss in power in comparison to HSIC and they would require a larger sample size. As we will see, this requirement for a larger sample size will be offset by a much lower computational cost in largescale examples.
5.2 Nonlinear experiments
In this section, we consider two nonlinear experiments: one with relatively small sample sizes (up to 4000 samples) and a largescale scenario (with number of samples from 1000 to \(10^7\)). For both experiments, we investigate the power versus time tradeoff of the introduced largescale approximate tests and compare their performance to quadratic time ones. A summary of the methods investigated here is illustrated in Table 1.
In addition to HSIC and its largescale versions, we will also consider a normalisation of HSIC: dCor (Székely et al. 2007; Székely and Rizzo 2009) which can be formulated in terms of HSIC using a Brownian Kernel with parameter \(h=0.5\) (Sejdinovic et al. 2013b, Appendix A). For clarity and consistency of comparison, methods using the same kernel will be compared with each other. In particular, we will compare HSIC and its largescale approximations with GdCor (dCor using Gaussian kernel with median heuristic bandwidth parameter) and its corresponding largescale approximations. As the asymptotic null distribution for GdCor is unclear, a permutation approach (see Sect. 2.4.1) will be used for p value computations. Similar comparison will be done for the Brownian kernel with \(h=0.5\).
Summary of the methods compared in the nonlinear dependence experiments
Gaussian kernel  Brownian kernel  

Exact \({\mathcal {O}}(m^2)\)  Approximate \({\mathcal {O}}(mn)\)  Exact \({\mathcal {O}}(m^2)\)  Approximate \({\mathcal {O}}(mn)\) 
(G)HSIC  Block  BHSIC  Block 
RFF  RFF?  
Nyström  Nyström  
GdCor  RFF  (B)dCor  RFF? 
Nyström  Nyström 
5.2.1 Smallscale nonlinear experiment
From Fig. 2, dCor clearly outperforms all the other methods with its Nyström approximation giving the closest performance in terms of power. Reassuringly, all the six methods using Gaussian kernel give very similar power performance. Although dCor and its largescale approximation give superior power performance, the performance of HSIC and its largescale approximations seems to be indifferent to the kernel used. Figure 3, however, tells a much more interesting story—the largescale methods all reach the power of 1 in a test time which is several orders of magnitude smaller, demonstrating the utility of the introduced tests.
5.2.2 Largescale nonlinear experiment
Now, we would like to compare the performance of the proposed largescale HSIC tests with each other—at sample sizes where standard HSIC/dCor approaches are no longer feasible. Gaussian kernel with median heuristic bandwidth will be used throughout this section contrasting the blockbased spectral approach, RFF spectral approach and the Nyström spectral approach. The other approximate methods listed in Table 1 require costly permutation approaches to compute the null distribution, and we therefore do not examine them in this subsection.
Figure 4 is a plot of the test power against the number of samples whereas Fig. 5 is a plot of the test power against average testing time. In the computational time comparison plot, we also present baseline experiments which simply apply the quadratic time HSIC spectral approach on a subset of the available data (with a subset size be in \(\{100, 200, 500, 1000, 2000, 4000\}\)). For the considered computation budgets, it is clear that this approach is unable to detect the dependence.
It is clear in Fig. 4 that for both \(d=50\) and \( d = 100\), the RFF method gives the best performance in power for a fixed number of samples, followed by the Nyström method and then by the blockbased approach. Although parallel computing could be employed for the blockbased HSIC method, as we observed, RFF and Nyström methods will be preferred for achieving higher statistical power at any given samples size. The RFF method is able to achieve zero type II error (i.e. no failure to reject a false null) with 5\(\times 10^4\) samples for \(d=50\) and 5\(\times 10^5\) samples for \(d=100\), while the Nyström method has a 80% false negative rate at these sample sizes. The power versus time plot in Fig. 5 gives a similar picture as Fig. 4 confirming the superiority of the RFF method on this example.
5.2.3 Real data experiment
6 Discussion and conclusions
We have proposed three largescale estimators of HSIC, a kernelbased nonparametric dependence measure—these are the blockbased estimator, the Nyström estimator and the RFF estimator. We subsequently established suitable independence testing procedures for each method—by taking advantage of the normal asymptotic null distribution of the blockbased estimator and by employing an approach that directly estimates the eigenvalues appearing in the asymptotic null distribution for the Nyström and RFF methods. All three tests significantly reduce computational complexity in memory and time over the standard HSICbased test. We verified the validity of our largescale testing methods and its favourable tradeoffs between testing power and computational complexity on challenging highdimensional synthetic data. We have observed that RFF and Nyström approaches have considerable advantages over the blockbased test. Several further extensions can be studied: the developed largescale approximations are readily applicable to threevariable interaction testing (Sejdinovic et al. 2013a), conditional independence testing (Fukumizu et al. 2008) as well as application in causal discovery (Zhang et al. 2011; Flaxman et al. 2015). Moreover, the RFF HSIC approach can be extended using the additional smoothing of characteristic function representations similarly to the approach of Chwialkowski et al. (2015) in the context of twosample testing. Furthermore, one can also consider the robustness of the proposed approaches in the case where heterogeneous datasets are considered, or when some form of withinsample dependence is introduced, e.g. when testing for independence between time series (Chwialkowski and Gretton 2014).
Footnotes
 1.
An alternative approach applicable to scalar variables uses kernelbased methods combined with a copula transformation (Póczos et al. 2012) can be used to tabulate the null distribution, but this approach is not straightforward to generalise to modelling dependence between random vectors.
 2.
A straightforward estimator of dCor (Székely et al. 2007; Székely and Rizzo 2009) is then given by normalising \(\varXi _b({\mathbf {z}})\) by the Frobenius norms of \(HK_xH\) and \(HK_yH\), i.e. \(\widehat{dCor}(\mathbf{z}) = \frac{\langle HK_xH, HK_yH \rangle }{\Vert HK_xH\Vert _F\Vert HK_yH\Vert _F}\).
 3.
For example, \(B=m^\delta \ {\text {with }} \delta \in (0,1)\).
References
 Anderson, N.H., Hall, P., Titterington, D.M.: Twosample test statistics for measuring discrepancies between two multivariate probability density functions using kernelbased density estimates. J. Multivar. Anal. 50, 41–54 (1994)MathSciNetCrossRefzbMATHGoogle Scholar
 Arcones, M.A., Gine, E.: On the bootstrap of \(U\) and \(V\) statistics. Ann. Stat. 20(2), 655–674 (1992)MathSciNetCrossRefzbMATHGoogle Scholar
 Aronszajn, N.: Theory of reproducing kernels. Trans. Am. Math. Soc. 68(3), 337–404 (1950)MathSciNetCrossRefzbMATHGoogle Scholar
 Bach, F., Jordan, M.I.: Kernel independent component analysis. J. Mach.Learn. 10, 1–48 (2002)zbMATHGoogle Scholar
 Berlinet, A., ThomasAgnan, C.: Reproducing Kernel Hilbert Spaces in Probability and Statistics. Kluwer, Boston (2004)CrossRefzbMATHGoogle Scholar
 BertinMahieux, T., Ellis, D.P., Whitman, B., Lamere., P.: The million song dataset. In: International Conference on Music Information Retrieval (ISMIR) (2011)Google Scholar
 Blaschko, M., Gretton, A.: Learning taxonomies by dependence maximization. Adv. Neural Inf. Process. Syst. 21, 153–160 (2009)Google Scholar
 Borgwardt, K.M., Gretton, A., Rasch, M.J., Kriegel, H.P., Schlkopf, B., Smola, A.J.: Integrating structured biological data by kernel maximum mean discrepancy. Bioinformatics 22(14), e49–e57 (2006). doi: 10.1093/bioinformatics/btl242 CrossRefGoogle Scholar
 Cho, Y., Saul, L.K.: Kernel methods for deep learning. In: Bengio, Y., Schuurmans, D., Lafferty, J.D., Williams, C.K.I., Culotta, A. (eds.) Advances in Neural Information Processing Systems, vol. 22, pp. 342–350. Curran Associates Inc., Red Hook (2009)Google Scholar
 Chwialkowski, K., Gretton, A.: A kernel independence test for random processes. In: Proceedings of the 31st International Conference on Machine Learning (2014)Google Scholar
 Chwialkowski, K., Ramdas, A., Sejdinovic, D., Gretton, A.: Fast twosample testing with analytic representations of probability measures. In: Advances in Neural Information Processing Systems (NIPS), vol. 28 (2015)Google Scholar
 Cortes, C., Mohri, M., Rostamizadeh, A.: Algorithms for learning kernels based on centered alignment. J. Mach. Learn. Res. 13(1), 795–828 (2012)MathSciNetzbMATHGoogle Scholar
 Dauxois, J., Nkiet, G.M.: Nonlinear canonical analysis and independence tests. Ann. Stat. 26(4), 1254–1278 (1998)MathSciNetCrossRefzbMATHGoogle Scholar
 Flaxman, S.R., Neill, D.B., Smola, A.J.: Gaussian processes for independence tests with noniid data in causal inference. ACM Trans. Intell. Syst. Technol. 7(2), 22:1–22:23 (2015)CrossRefGoogle Scholar
 Fukumizu, K., Gretton, A., Sun, X., Schölkopf, B.: Kernel measures of conditional dependence. In: In Adv. NIPS (2008)Google Scholar
 Gretton, A., Bousquet, O., Smola, A., Schölkopf, B.: Measuring Statistical Dependence with Hilbert–Schmidt Norms. Lecture Notes in Computer Science, pp. 63–77 (2005)Google Scholar
 Gretton, A., Fukumizu, K., Schölkopf, B., Teo, C.H., Song, L., Smola, A.J.: A kernel statistical test of independence. In: NIPS (2008)Google Scholar
 Gretton, A., Fukumizu, K., Harchaoui, Z., Sriperumbudur, B.: A fast, consistent kernel twosample test. In: Advances in Neural Information Processing Systems, vol. 22. Curran Associates Inc., Red Hook, pp. 673–681 (2009). papers/1006_paperlong.pdfGoogle Scholar
 Gretton, A., Sriperumbudur, B., Sejdinovic, D., Strathmann, H., Balakrishman, S., Pontil, M., Fukumizu, K.: Optimal kernel choice for largescale twosample tests. In: Advances in Neural Information Processing Systems (2012a)Google Scholar
 Gretton, A., Borgwardt, K.M., Rasch, M.J., Schölkopf, B., Smola, A.: A kernel twosample test. J. Mach. Learn. Res. 13, 723–773 (2012b)MathSciNetzbMATHGoogle Scholar
 Huang, S.Y., Lee, M.H., Hsiao, C.K.: Nonlinear measures of association with kernel canonical correlation analysis and applications. J. Stat. Plan. Inference 139(7), 2162–2174 (2009)MathSciNetCrossRefzbMATHGoogle Scholar
 Jitkrittum, W., Szabo, Z., Gretton, A.: An Adaptive Test of Independence with Analytic Kernel Embeddings (2016)Google Scholar
 Lai, P., Fyfe, C.: Kernel and nonlinear canonical correlation analysis. Int. J. Neural Syst. 10(5), 365–377 (2000)CrossRefGoogle Scholar
 LopezPaz, D.: From Dependence to Causation. Ph.D. thesis, University of Cambridge (2016)Google Scholar
 LopezPaz, D., Hennig, P., Schölkopf, B.: The randomized dependence coefficient. Adv. Neural Inf. Process. Syst. 26, 1–9 (2013)Google Scholar
 LopezPaz, D., Sra, S., Smola, A., Ghahramani, Z., Schölkopf, B.: Randomized nonlinear component analysis. In: Proceedings of the 31st International Conference on Machine Learning, pp. 1359–1367 (2014)Google Scholar
 Lyons, R.: Distance covariance in metric spaces. Ann. Probab. 41(5), 3284–3305 (2013)MathSciNetCrossRefzbMATHGoogle Scholar
 Mardia, K., Kent, J., Bibby, J.: Multivariate Analysis. Academic Press, New York (1979)zbMATHGoogle Scholar
 Nguyen, D., Eisenstein, J.: A Kernel Independence Test for Geographical Language Variation (2016). ArXiv:1601.06579
 Peters, J., Mooij, J.M., Janzing, D., Schölkopf, B.: Causal discovery with continuous additive noise models. J. Mach. Learn. Res. 15(1), 2009–2053 (2014)MathSciNetzbMATHGoogle Scholar
 Póczos, B., Ghahramani, Z., Schneider, J.G.: Copulabased kernel dependency measures. In: Proceedings of the 29th International Conference on Machine Learning, ICML 2012, Edinburgh, Scotland, UK, June 26–July 1, 2012 (2012)Google Scholar
 Rahimi, A., Recht, B.: Random Features for LargeScale Kernel Machines. Adv. Neural Inf. Process. Syst. 20 (2007)Google Scholar
 Reed, M., Simon, B.: Methods of Modern Mathematical Physics. I: Functional Analysis, 2nd edn. Academic Press, New York (1980)zbMATHGoogle Scholar
 Rubenstein, P.K., Chwialkowski, K.P., Gretton, A.: A kernel test for threevariable interactions with random processes. arXiv preprint (2015). ArXiv:1603.00929
 Schölkopf, B., Smola, A.: Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press, Cambridge (2002)Google Scholar
 Sejdinovic, D., Gretton, A., Bergsma, W.: A kernel test for threevariable interactions. Adv. Neural Inf. Process. Syst. (NIPS) 26, 1124–1132 (2013a)Google Scholar
 Sejdinovic, D., Sriperumbudur, B., Gretton, A., Fukumizu, K.: Equivalence of distancebased and RKHSbased statistics in hypothesis testing. Ann. Stat. 41(5), 2263–2291 (2013b)MathSciNetCrossRefzbMATHGoogle Scholar
 Sejdinovic, D., Strathmann, H., De, S., Zaremba, W., Blaschko, M., Gretton., A.: Big Hypothesis Tests with Kernel Embeddings: An Overview. Technical Report (2014). Gatsby Unit, UCLGoogle Scholar
 Serfling, R.J.: Approximation Theorems of Mathematical Statistics. Wiley, New York (2002)zbMATHGoogle Scholar
 Smola, A., Gretton, A., Song, L., Schölkop, B.: A Hilbert space embedding for distributions. In: Algorithmic Learning Theory: 18th International Conference, pp. 13–31 (2007)Google Scholar
 Snelson, E., Ghahramani, Z.: Sparse Gaussian processes using pseudoinputs. In: Advances in Neural Information Processing Systems, vol. 18, pp. 1257–1264. MIT press (2006)Google Scholar
 Song, L., Smola, A., Gretton, A., Borgwardt, K.M.: A dependence maximization view of clustering. In: Proceedings of the 24th International Conference on Machine Learning, pp. 815–822 (2007)Google Scholar
 Song, L., Smola, A., Gretton, A., Bedo, J., Borgwardt, K.: Feature selection via dependence maximization. J. Mach. Learn. Res. 13, 1393–1434 (2012). http://jmlr.csail.mit.edu/papers/v13/song12a.html
 Sriperumbudur, B.K.: Reproducing Kernel Space Embeddings and Metrics on Probability Measures. PhD Thesis, University of California, San Diego (2010)Google Scholar
 Steinwart, I., Christmann, A.: Support Vector Machines. Springer, New York (2008)zbMATHGoogle Scholar
 Sutherland, D.J., Schneider, J.: On the error of random fourier features. In: Conference on Uncertainty in Artificial Intelligence (UAI) (2015)Google Scholar
 Székely, G.J., Rizzo, M.L.: Brownian distance covariance. Ann. Appl. Stat. 3(4), 1236–1265 (2009)MathSciNetCrossRefzbMATHGoogle Scholar
 Székely, G.J., Rizzo, M.L., Bakirov, N.K.: Measuring and testing dependence by correlation of distances. Ann. Stat. 35(6), 2769–2794 (2007)MathSciNetCrossRefzbMATHGoogle Scholar
 Wendland, H.: Scattered Data Approximation. Cambridge University Press, Cambridge (2005)zbMATHGoogle Scholar
 Williams, C.K.I., Seeger, M.: Using the Nyström method to speed up kernel machines. In: Leen, T., Dietterich, T., Tresp, V. (eds.) Advances in Neural Information Processing Systems, vol. 13, pp. 682–688. MIT Press (2001). http://papers.nips.cc/paper/1866usingthenystrommethodtospeedupkernelmachines
 Zaremba, A., Aste, T.: Measures of causality in complex datasets with application to financial data. Entropy 16(4), 2309 (2014)CrossRefGoogle Scholar
 Zaremba, W., Gretton, A., Blaschko, M.: Btest: a nonparametric, low variance kernel twosample test. In: Advances in Neural Information Processing Systems (2013)Google Scholar
 Zhang, K., Peters, J., Janzing, D., Schölkopf, B.: Kernelbased conditional independence test and application in causal discovery. In: Uncertainty in Artificial Intelligence, pp. 804–813 (2011)Google Scholar
 Zhao, J., Meng, D.: FastMMD: ensemble of circular discrepancy for efficient twosample test. Neural Comput. 27(6), 1345–1372 (2015)CrossRefGoogle Scholar
Copyright information
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.