Analysis of Conditional Randomisation and Permutation schemes with application to conditional independence testing

We study properties of two resampling scenarios: Conditional Randomisation and Conditional Permutation schemes, which are relevant for testing conditional independence of discrete random variables $X$ and $Y$ given a random variable $Z$. Namely, we investigate asymptotic behaviour of estimates of a vector of probabilities in such settings, establish their asymptotic normality and ordering between asymptotic covariance matrices. The results are used to derive asymptotic distributions of the empirical Conditional Mutual Information in those set-ups. Somewhat unexpectedly, the distributions coincide for the two scenarios, despite differences in the asymptotic distributions of the estimates of probabilities. We also prove validity of permutation p-values for the Conditional Permutation scheme. The above results justify consideration of conditional independence tests based on resampled p-values and on the asymptotic chi-square distribution with an adjusted number of degrees of freedom. We show in numerical experiments that when the ratio of the sample size to the number of possible values of the triple exceeds 0.5, the test based on the asymptotic distribution with the adjustment made on a limited number of permutations is a viable alternative to the exact test for both the Conditional Permutation and the Conditional Randomisation scenarios. Moreover, there is no significant difference between the performance of exact tests for Conditional Permutation and Randomisation schemes, the latter requiring knowledge of conditional distribution of $X$ given $Z$, and the same conclusion is true for both adaptive tests.


Introduction
Checking for conditional independence is a crucial ingredient of many Machine Learning algorithms, such as those designed to learn structure of graphical models or select active predictors for the response in a regression task, see e.g.[1][2][3].In a greedy approach to the variable selection for the response, one needs to verify whether predictor X is conditionally independent of the response, say, Y , given Z (denoted by X ⊥ ⊥ Y |Z), where Z is a vector of predictors already chosen as active ones and X is any of the remaining candidates.When conditional independence holds, then X is deemed irrelevant; when the test fails, the candidate that 'most strongly' contradicts it, is chosen.
Verification of conditional independence of discrete-valued random variables uses a specially designed test statistic, say, T , such as Pearson χ 2 chi-square statistic or Conditional Mutual information CM I.The value of the statistic, calculated for the data considered, is compared with a benchmark distribution.Usually, as a benchmark distribution one either uses the asymptotic distribution of T under conditional independence or its distribution (or approximation thereof) obtained for resampled samples which conform to conditional independence.More often than not, the asymptotic test is too liberal, especially for small sample sizes, what leads to acceptance of too many false positive predictors.That is why resampling methods are of interest in this context (for other approaches see e.g.[4][5][6] and references therein).The resampling is commonly performed by either permuting values of X on each strata of Z, see e.g.[7], or by replacing original values of X by values generated according to conditional distribution P X|Z if the distribution is known (we will refer to the former as Conditional Permutation and to the latter as Conditional Randomisation, [4]).Although the validity of resampling approach in the latter case can be established fairly easily (see ibidem), it was previously unknown for the conditional permutation approach as well as for the asymptotic approach in both settings.Based on the proved asymptotic results, we propose a modified asymptotic test that uses a χ 2 distribution with an adjusted number of degrees of freedom as the benchmark distribution.The major contributions of the paper are thus as follows: we (i) establish validity of the resampling method for conditional permutation approach; (ii) derive the asymptotic distributions of the estimated vector of probabilities and of the estimator of CM I under both resampling scenarios; (iii) compare asymptotic and resampled pvalues approach in numerical experiments.In numerical experiments, we show that for the models considered and a ratio of the sample size to the size of the support of (X, Y, Z) larger than 0.5, the test based on the asymptotic distribution with adjustments based on a limited number of permutations performs equally well or better than the exact test for both the Conditional Permutation and the Conditional Randomisation scenarios.Moreover, there is no significant difference in the performance of the exact tests for Conditional Permutation and Conditional Randomisation scheme, the latter requiring knowledge of the conditional distribution of X given Z.The same is true for both adaptive tests.
As the null hypothesis of conditional independence is composite, an important question arises: how to control the type I error by choosing adequate conditionally independent probability structures.In the paper, we adopt a novel approach to address this issue, which involves investigating those null distributions that are Kullback-Leibler projections of probability distributions for which power is investigated.
An important by-product of the investigation in (i) is that we establish asymptotic normality of the normalized and centered vector having a multivariate hyper-geometric or generalized hyper-geometric distribution for the conditional permutation scheme.

Preliminiaries
We consider a discrete-valued triple (X, Y, Z), where X ∈ X , Y ∈ Y, Z ∈ Z, and all variables are possibly multivariate.Assume that P (X = x, Y = y, Z = z) = p(x, y, z) > 0 holds for any (x, y, z) ∈ X × Y × Z.Moreover, we let p(x, y|z) = P (X = x, Y = y|Z = z), where p(z) = P (Z = z) and define p(x|z) and p(y|z) analogously.We will denote by I, J, K the respective sizes of supports of X, Y and Z: |X |= I, |Y|= J, |Z|= K.As our aim is to check conditional independence, we will use Conditional Mutual Information (CM I) as a measure of conditional dependence (we refer to [8] for basic information-theoretic concepts such as entropy and mutual information).Conditional Mutual Information is a non-negative number defined as
We stress that the conditional mutual information is the mutual information (M I) of Y and X given Z = z, defined as the mutual information between P Y X|Z=z and the product of P Y |Z=z and P X|Z=z , averaged over the values of Z.As M I is Kullback-Leibler divergence between the joint and the product distribution, it follows from the properties of Kullback-Leibler divergence that I(Y ; X|Z) = 0 ⇐⇒ X and Y are conditionally independent given Z.
This is a powerful property, not satisfied for other measures of dependence, such as the partial correlation coefficient in the case of continuous random variables.The conditional independence of X and Y given Z will be denoted by X ⊥ ⊥ Y |Z and referred to as CI.We note that since I(Y ; X|Z) is defined as a probabilistic average of I(Y ; X|Z = z) over Z = z, it follows that I(Y ; X|Z) = 0 ⇐⇒ I(Y ; X|Z = z) = 0 for any z in the support of Z.This is due to (1) as be an independent sample of copies of (X, Y, Z) and consider the unconstrained maximum likelihood estimator of the probability mass function (p.m.f.) ((p(x, y, z)) x,y,z based on this sample being simply a vector of fractions ((p(x, y, z)) x,y,z = (n(x, y, z)/n) x,y,z , where n(x, y, z) = In the following, we will examine several resampling schemes that involve generating new data such that they satisfy CI hypothesis for the fixed original sample.Extending the observed data to an infinite sequence, we will denote by P * the conditional probability related to the resampling schemes considered, given the sequence

Resampling scenarios
We first discuss the Conditional Permutation scheme, which can be applied to conditional independence testing.We then establish validity of the p-values based on this scheme, and the form of asymptotic distribution for the sample proportions, which is used later to derive asymptotic distribution of empirical CM I.

Conditional Permutation (CP) scenario
We assume that the sample (X, Y, Z) = (X i , Y i , Z i ) n i=1 is given and we consider CI hypothesis H 0 : X ⊥ ⊥ Y |Z.The Conditional Permutation (CP) scheme, used e.g. in [7], is a generalisation of a usual permutation scenario applied to test unconditional independence of X and Y .It consists in the following: for every value z k of Z appearing in the sample, we consider the strata corresponding to this value, namely CP sample is obtained from the original sample by replacing , where π k is a randomly and uniformly chosen permutation of P k and π k are independent (see Algorithm 1).Thus on every strata Z = z, we randomly permute values of corresponding X independently of values of Y .It is, in fact, sufficient to permute only the values of X to ensure conditional independence, which follows from the fact that for any discrete random variable (X, Y ) we have that X is independent of σ(Y ), where σ is a randomly and uniformly chosen permutation of the values of Y such that σ ⊥ ⊥ (X, Y ).The pseudo-code of the algorithm is given below.We consider the family of all permutations Π of all permutations π of {1, . . ., n} which preserve each of P k i.e. π is composed of π k 's, i.e. such that their restriction to every P k is a permutation of P k .The number of such permutations is

Validity of p-values for CP scenario
We first prove the result which establishes validity of resampled p-values for any statistic for the Conditional Permutation scheme.Let ).The pertaining p-value based on CP resampling is defined as Thus, up to ones added to the numerator and the denominator, the resampling p-value is defined as the fraction of T * b not smaller than T (ones are added to avoid null p-values).Although p-values based on CP scheme have been used in practice (see e.g.[7]) to the best of our knowledge, their validity has not been established previously, to the best of our knowledge.
Theorem 1 (Validity of p-values for CP scheme) If the null hypothesis H 0 : X ⊥ ⊥ Y |Z holds, then where T = T (Xn, Yn, Zn) and The result implies that if the testing procedure rejects H 0 when the resampling p-value does not exceed α its level of significance is also controlled at α.The proof is based on exchangeability of T, T * 1 , . . ., T * B and is given in the Appendix.

Asymptotic distribution of sample proportions for Conditional Permutation method
We define p * to be an empirical p.m.f.based on sample (X * , Y, Z): p * (x, y, z) = , where π ∈ Π is randomly and uniformly chosen from Π. Similarly to n(x, y, z) we let n(y, z) = n i=1 I{Y i = y, Z i = z} and n(x, z) is defined analogously.We first prove Theorem 2 (i) Joint distribution of the vector (np * (x, y, z))x,y,z given i=1 is as follows: where (k(x, y, z))x,y,z is a sequence taking values in nonnegative integers such that x k(x, y, z) = n(y, z) and i=1 is given by the following weak convergence √ n p * (x, y, z) − p(x|z)p(y|z)p(z) x,y,z for almost all (X i , Y i , Z i ) ∞ i=1 , where Σ x ,y ,z x,y,z , element of Σ corresponding to row index x, y, z and column index x , y , z , is defined by  We stress that (2) is a deterministic equality describing the distribution of np * : for k(x, y, z) x,y,z such that x k(x, y, z) = n(y, z) and y k(x, y, z) = n(x, z) (where n(x, z) and n(y, z) are based on the original sample) corresponding value of p.m.f. is given by the left-hand side, otherwise it is 0.
Proof (i) The proof is a simple generalisation of the result of J. Halton [9] who established the form of the conditional distribution of a bivariate contingency table given its marginals and we omit it.(ii) In view of (2) subvectors , thus in order to prove (3) it is sufficient to prove analogous result when the stratum Z = z, i.e. for the unconditional permutation scenario.Note that since we consider conditional result given (X i , Y i , Z i ) ∞ i=1 ,the strata sample sizes n(z i ) are deterministic and such that n(z i )/n → P (Z = z i ) for almost every such sequence.The needed result is stated below.
Theorem 3 Assume that n ij , i = 1, . . ., I, j = 1, . . ., J are elements of I × J contingency table based on iid sample of n observations pertaining to a discrete distribution (p ij ) satisfying p ij = p i. p .j .Then we have provided p ij > 0 for all i, j that where Σ = (Σ k,l i,j ) and Σ k,l i,j = p i. (δ ik − p .k )p .j(δ jl − p l. ).
Remark 1 Let (X * i , Y i ) n i=1 be a sample obtained from (X i , Y i ) n i=1 by a random (unconditional) permutation of values of X i and p * (x, y) be an empirical p.m.f.corresponding to (X * i , Y i ) n i=1 .Then obviously (n ij /n) and (p * (x, y)) follow the same distribution and ( 5) is equivalent to Moreover, the elements of Σ can be written as (compare (4)) Remark 2 Matrix Σ introduced above has the rank (I − 1) × (J − 1) and can be written using the tensor products as (diag(α) − α ⊗ α) ⊗ (diag(β) − β ⊗ β), where α = (p i. ) i and β = (p .j ) j .
The proof of Theorem 3 follows from a weak convergence result for tablevalued hypergeometric distributions and is important in its own right.
Let R denote the range of indices (i, j): R = {1, . . ., I} × {1, . . ., J}.For Suppose that the law of Wr = (W (r) ij ) (i,j)∈R is given by Then, where Σ = (Σ k,l i,j ) and The proof of Lemma 4 is relegated to the Appendix.Theorem 3 is a special case of Lemma 4 with a r = (n i. ) i , b r = (n j. ) j , r = n on a probability space (Ω, F, P n ), where ) is a regular conditional probability.

Conditional Randomisation scenario
We now consider the Conditional Randomisation (CR) scheme, popularised in [4].This scheme assumes that the conditional distribution P X|Z is known, and the resampled sample is , where X * i is independently generated according to the conditional distribution P X|Z=zi and independently of (X, Y).
The assumption that P X|Z is known is frequently considered (see e.g.[4] or [10]) and is realistic in the situations when a large database containing observations of unlabelled data (X, Z) is available, upon which an accurate approximation of P X|Z is based.Theorem 4 in [10] justifies the robustness of the type I error for the corresponding testing procedure.We note that the conclusion of Theorem 1 is also valid for CR scenario (cf.[4], Lemma 4.1).Let p * (x, y, z) where The proof which is based on multivariate Berry-Esseen theorem is moved to the Appendix.is a nonnegative definite matrix (see Lemma 6 in the Appendix).The inequalities between the covariance matrices can be strict.In view of this, it is somewhat surprising that the asymptotic distributions of CM I based on p * in all resampling scenarios coincide.This is investigated in the next Section.

Asymptotic distribution of CM I for considered resampling schemes
We consider CM I as a functional of probability vector (p(x, y, z)) x,y,z defined as (compare (1))
We prove that despite differences in asymptotic behaviour of n 1/2 (p * − p) for both resampling schemes considered, the asymptotic distributions of based on them coincide.Moreover, the common limit coincides with asymptotic distribution of CM I, namely χ 2 distribution with (|X |−1) × (|Y|−1) × |Z| degrees of freedom.Thus in this case the general bootstrap principle holds as the asymptotic distributions of CM I and CM I * are the same.
Theorem 6 For almost all sequences (X i , Y i , Z i ), i = 1, . . .and conditionally on a.e., where p * is based on CP or CR scheme.
Proof We will prove the result for the Conditional Permutation scheme and indicate the differences in the proof in the case of CR scheme at the end.The approach is based on delta method as in the case of CM I (see e.g.[6]).The gradient and Hessian of CM I(p) considered as a function of p are equal to, respectively, and where (H CM I (p)) x ,y ,z x,y,z denotes element of Hessian with row column index x, y, z and column index x , y , z .In order to check it, it is necessary to note that e.g. the term p(x , y ) = z p(x , y , z ) contains the summand p(x, y, z) if x = x and y = y , and thus ∂p(x ,y ) ∂p(x,y,z) = I(x = x , y = y ).The proof follows now from expanding CM I(p * ) around pci := p(x|z)p(y|z)p(z): where ξ = (ξx,y,z)x,y,z and ξx,y,z is a point in-between p * (x, y, z) and pci (x, y, z).We note that CM I(p ci ) = 0 as pci is a distribution satisfying CI and, moreover, the gradient of conditional mutual information D CM I at pci is also 0 as Thus two first terms on RHS of (11) are 0.Moreover, using continuity of H CM I (•) following from p(x, y, z) > 0 for all (x, y, z) and ( 3) it is easy to see that where Z = (Zx,y,z)x,y,z ∼ N (0, I) and λx,y,z are eigenvalues of a matrix M = H CM I (p ci )Σ.To finish the proof it is enough to check that M is idempotent, thus all its eigenvalues are 0 or 1, and verify that the trace of M equals (|X |−1)×(|Y|−1)×|Z|.This is proved in Lemma 3 in the Appendix.The proof for CR scheme is analogous and differs only in that in the final part of the proof matrix M is replaced by matrix M = H CM I Σ where Σ is defined in Theorem 5.However, its shown in Lemma 3 in the Appendix that M = M thus the conclusion of the Theorem holds also for CR scheme.

Remark 4
We note that two additional resampling scenarios can be defined.The first one, which we call bootstrap.X, is a variant of CR scenario in which, instead of sampling on the strata Z = z i from the distribution P X|Z=zi the pseudo-observations are sampled from the empirical distribution of P (x|z i ).In order to introduce the second proposal, Conditional Independence Bootstrap (CIB), consider first empirical distribution pci = p(x|z)p(y|z)p(z).We note that probability mass function (p ci (x, y, z))x,y,z is the maximum likelihood estimator of p.m.f.(p(x, y, z))x,y,z when conditional independence of X and Y given Z holds.Then (X * i , Y i , Z i ) n i=1 is defined as iid sample given (X, Y, Z) drawn from pci .Note that there is a substantial difference between this and previous scenarios as in contrast to them X and Z observations are also sampled.For the both scenarios convergence established in Theorem 6 holds Fig. 1: Considered models (see [11]).However, we conjecture that validity of p-values does not hold for these schemes.As we did not establish substantial advantages of using either bootstrap.X or CIB over neither CP or CR scheme we have not pursued discussing them here in detail.

Numerical experiments
In the experiments, we will consider the following modification of a classical asymptotic test based on χ 2 distribution as the reference distribution.Namely, since it is established in Theorem 6 that 2n × CM I * is approximately χ 2 distributed for both scenarios considered, we use the limited number of resampled samples to approximate the mean of the distribution of 2n × CM I * and use the obtained value as an estimate of the number of degrees of freedom of χ 2 distribution.The adjustment corresponds to the equality of the mean and the number of degrees of freedom in the case of χ 2 distribution.Thus, we still consider χ 2 distribution as the reference distribution for CI testing; however, we adjust its number of degrees of freedom.The idea appeared already in [7].Here, the approach is supported by Theorem 6 and the behaviour of the resulting test is compared with the other tests considered in the paper.
We will thus investigate three tests in both resampling schemes CR and CP.The test which will be called exact is based on Theorem 1 in the case of CP scenario and the analogous result for CR scenario in [4].The test df estimation uses χ 2 distribution with the degrees of freedom estimated in data-dependent way as just described.As a benchmark test we use the asymptotic test which uses the asymptotic χ 2 distribution established in Theorem 6 as a reference distribution.Choice of number of resampled samples B. As in the case of df estimation test the reference distribution involves only the estimator of the mean and not the estimators of upper quantiles of high order, we use a moderate number of resampled samples B = 50 for this purpose.In order to have equal computational cost for all tests, B = 50 is also used in the case of exact test.Note that applying moderate B renders application of such tests in greedy feature selection (when such tests have to be performed many times) feasible.
The models considered are standard models to study various types of conditional dependence of X and Y given vector Z: e.g. in model 'Y to XZ' , Y conveys information to both X and Z whereas in model 'X and Y to Z' both X and Y convey information to Z. Model XOR is a standard model to investigate interactions of order 3. Below we will describe the considered models in detail by giving the formula for joint distribution of (X, Y, Z 1 , Z 2 , . . ., Z s ).Conditional independence case (the null hypothesis) will be investigated by projecting considered models on the family of conditionally independent distributions.
• Model 'Y to XZ' (the first panel of Figure 1).Joint probability in the model is factorised as follows p(x, y, z 1 , z 2 , . . ., z s ) = p(y)p(x, z 1 , z 2 , . . ., z s |y), thus it is sufficient to define p.m.f. of Y and conditional p.m.f. of (X, Z 1 , . . ., Z s ) given Y .First, Y is a Bernoulli random variable with probability of success equal to 0.5 and conditional distribution of ( X, Z1 , . . ., Zs ) given Y = y follows a multivariate normal distribution N s+1 (yγ s , σ 2 I s+1 ), where γ s = (1, γ, . . ., γ s ), and γ ∈ [0, 1] and σ > 0 are parameters in that model.In order to obtain discrete variables from continuous ( X, Z1 , . . ., Zs ) we define the conditional distribution of (X, Z 1 , . . ., Z s ) given Y = y by assuming their conditional independence given Y and The variables X and Z i all have Bern(0.5)distribution and conditional distribution of Y follows • Model 'XY to Z' (the third panel in Figure 1) The joint probability factorises as follows p(x, y, z 1 , z 2 , . . ., z s ) = p(x)p(y) X and Y are independent and both follow Bernoulli distribution Bern(0.5).
The distribution of Z i depends on the arithmetic mean of X and Y and the variables Z 1 , . . ., Z s are conditionally independent given (X, Y ).They follow Bernoulli distribution ) for i ∈ {1, 2, ..., s}, where α ≥ 0 controls the strength of dependence.For α = 0, the variables Z i do not depend on (X, Y ).
• Model XOR The distribution of Y is defined as follows: where 0.5 < β < 1 and = 2 denotes addition modulo 2. We also introduce variables Z 3 , Z 4 , . . ., Z s independent of (X, Y, Z 1 , Z 2 ) .All variables X, Z 1 , Z 2 , . . ., Z s are independent and binary with the probability of success equal to 0.5.
We run simulations for fixed model parameters (Model 'Y to XZ': γ = 0.5, σ = 0.5, Model 'XZ to Y': σ = 0.07, model 'XY to Z': α = 3, model XOR: β = 0.8.In all the models the same number of conditioning variables s = 4 was considered.The parameters are chosen in such a way that in all four models values of conditional mutual information CM I(X, Y |Z) are similar and contained in the interval [0.16, 0.24] (see Figure 2 for λ = 0 which corresponds to the chosen p.m.f.p(x, y, z)).We define a family of distributions parameterised by parameter λ ∈ [0, 1] in the following way: where p denotes the joint distribution pertaining to the model with the chosen parameters and p ci (x, y, z) = p(x|z)p(y|z)p(z) is the Kullback-Leibler projection of p onto the family P ci of p.m.fs satisfying conditional independence X ⊥ ⊥ Y |Z (see Lemma 4 in Appendix).Probability mass function p ci (x, y, z) can be explicitly calculated for the given p(x, y, z).Note that λ is a parameter which controls the strength of shrinkage of p towards p ci .We also underline that the Kullback-Leibler projection of p λ onto P ci is also equal to p ci (see Lemma 5 in the Appendix).Figure 2 shows how conditional mutual information of X and Y given (Z 1 , Z 2 , . . ., Z s ) changes with respect to λ.For λ = 1, p λ = p ci , thus X and Y are conditionally independent and CM I(X, Y |Z) = 0.
The simulations, besides standard analysis of attained levels of significance and power, are focused on the following issues.Firstly, we analyse levels of significance of CM I-based tests for small sample sizes.It is known that for small sample sizes problems with control of significance levels arise, as the probability of obtaining the samples which result in empty cells (i.e.some values of (x, y, z 1 , . . ., z s ) are not represented in the sample) is high.This issue obviously can not be solved by increasing the number of resampled samples as it is due the original sample itself.However, we would like to check whether using χ 2 distribution with estimated number of degrees of freedom as a benchmark distribution provides a solution to this problem.Moreover, the power of such  tests in comparison with exact tests is of interest.Secondly, it is of importance to verify whether the knowledge of the conditional distribution of X given Z which is needed for CR scheme, actually translates into better performance of the resulting test over the performance of the same test in CP scenario.The conditional independence hypothesis is a composite hypothesis, thus an important question is how to choose representative null examples on which control of significance level should be checked.Here we adapt a natural, and to our knowledge, novel approach which consists in considering as the nulls the projections p ci of p.m.fs p for which power is investigated.
In Figure 3 histograms of p ci (x, y, z) for the considered models are shown.Although all 2 s+2 = 64 probabilities p(x, y, z) are larger than 0 in all the models, some probabilities may be very close to 0 (as it happens in 'XZ to Y' model).For model XOR all triples are equally likely and thus for all (x, y, z) p ci (x, y, z) = 1/2 6 = 0.015625.If there are many values of p ci (x, y, z) that are close to 0, the probability of obtaining a sample without some triples (x, y, z) for which p ci (x, y, z) > 0 is high.In particular, this happens in 'XZ to Y' model.In the following the performance of the procedures is studied with respect to the parameter frac = n/2 s+2 instead of sample size n.As the number of unique values of triples (x, y, z) equals 2 s+2 , thus frac is the average number of observations per cell in the uniform case and roughly corresponds to this index for a general binary discrete distribution.In Table 1 we provide the values of sample sizes corresponding to changing frac as well as the value of np min for s = 4, where p min is the minimal value of either probability mass function p(x, y, z) or p ci (x, y, z).As np min is the expected value of observations for the least likely triple it indicates that occurrence of empty cells is typical for frac as large as 20.In Figure 4 the estimated fraction of rejections for the tests based on resampling in case when the null hypothesis is true (λ = 1) is shown when the assumed level of significance equals 0.05.The attained levels of significance for asymptotic test are given separately in Figure 5. Overall, for all the procedures based on resampling the attained level of significance is approximately equal to the assumed one.The df estimation methods both for CP and CR do not exceed assumed significance level for the considered range of frac ∈ [0.5, 5]. Figure 4 indicates that distribution of CM I is adequately represented by χ 2 distribution with estimated number of degrees of freedom.This will be further analysed below (see discussion of Figures 5 and 6).In Figure 5   4 .For all the models except 'XZ to Y' for small number of observations per cell we underestimate the mean of 2n CM I by using the asymptotic number of degrees of freedom and in these cases the significance level is exceeded.This effect is apparent even for frac equal to 5. On the other hand in the model 'XZ to Y' the situation is opposite and in this case the test rarely rejects the null hypothesis.This is due to the overestimation of the mean of 2n CM I by asymptotic number of degrees of freedom in the case when many empty cells occur.Note that the estimation of the mean based on resampled samples is much more accurate (in Figure 5  obtained results is marked in blue).We also note that the condition np min ≥ 5 is frequently cited as the condition under which test based on asymptotic χ 2 distribution can be applied.Note, however, that in the considered examples and for frac ≥ 20, asymptotic test controls fairly well level of significance, whereas np min can be of order 10 −11 (Table 1).Moreover, for frac=20 and λ = 0.5 the power of asymptotic test is 1.
In Figure 6 we compare the distributions of CM I with those of resampling distributions of CM I * and χ 2 distribution with the estimated number of degrees of freedom by means of QQ plots.For each of 500 original samples 50 resampled samples are generated by the Conditional Permutation method and quantiles of resampling distributions of CM I * are calculated, resulting in 500 quantiles, medians of which which are shown in the plot.Medians of quantiles for χ 2 distribution with an estimated number of degrees of freedom are obtained in the similar manner.Quantiles of the asymptotic distribution are also shown.Besides the fact that the distribution of CM I is better approximated by the distribution of CM I * , what confirms the known property of bootstrap in the case of CM I estimation (compare Section 2.6.1 in [12]), it also follows from the figure that the distribution of CM I is even better approximated by χ 2 distribution with estimated number of degrees of freedom.Figure 7 shows the results for the power of testing procedures for λ = 0.25, 0.5, 0.75 with respect to frac.Since asymptotic test does not control significance level for these models for λ = 1, the pertaining power is omitted from the figure.As for increasing λ, p.m.f. of p λ approaches the null hypothesis described by p ci the power becomes smaller in rows.As frac gets smaller, the power of the tests also decreases and this is due to the increased probability of obtaining empty cells (x, y, z 1 , . . ., z s ) in the sample, and because of that such observations are also absent in the resampled samples for Conditional Permutation scheme.CR is more robust in this respect as such occurs only when not all values of (z 1 , . . ., z s ) are represented in the sample.This results in better performance of the tests for CR scheme than for CP scheme (each estimated mean is based on B = 50 resampled samples and the simulation is repeated 500 times; the average of the obtained means and mean±SE is shown in blue.The number of degrees of freedom for asymptotic χ 2 distribution is a solid horizontal line. q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q Y to XZ XZ to Y XY to Z XOR  for small values of frac (see also Figure 8).It follows that the procedures based on χ 2 distribution with the estimated number for of degrees of freedom are more powerful than exact tests, regardless of the resampling scenario used.Although the advantage is small, it occurs in all cases considered.The plot also indicates that exact tests in both scenarios act similarly and are inferior to tests based on asymptotic distribution with estimated dfs which also exhibit similar behaviour.We compare powers in CP and CR scenarios in Figure 8 in which ratios of respective powers for exact tests and df estimation tests are depicted by orange and green lines, respectively.The values below 1 mean that the CR has greater power.The differences occur only for small frac values.Both df estimation and exact tests have larger power in CR scenario than in CP scenario for frac ∈ [0.5, 2].The power for both methods is similar for frac ≥ 2, thus it follows that CP scenario might be used instead of CR, as it is as efficient as CR.Our conclusions can be summarised as follows: • The significance level is controlled by df estimation and exact tests both for CP and CR scenarios.It happens that asymptotic test does not control significance level even for frac larger than 10.Interestingly, although asymptotic case is usually significantly too liberal for small frac it also happens that it is very conservative (Figure 4, model 'XZ to Y'); Supplementary information.Appendix contains all proofs of the results in the paper, which have not been presented in the main body of the article.
As we have proven the exchangeability of the sample and resampled samples given Zn, the test statistics based on them are also exchangeable given Zn.By averaging over Zn the property also holds unconditionally.
In order to prove Lemma 4 we start with following simple lemma, which is crucial for our argument.
Proof Assume that t i is a continuity point of F i .Then for i = 1, . . ., d, By Lebesgue's dominated convergence theorem, the latter term converges to 0 as r → ∞.Thus, by induction, the cumulative distribution function of (W where Σ is a (d − 1)-rank matrix with elements The univariate case is proved in [13, Th. 2.1].We could not find an appropriate reference for the general case.However, we refrain from giving a formal proof of the multivariate case, as it follows from the univiariate case in analogous way as Lemma 4 follows from Lemma 8 and we present a full argument below.
We now prove Lemma 4.
Proof First, observe that (6) can be rewritten as ∼ Hyp J (a (r) 1 , br), where Hyp J is defined in Lemma 8. Since |br|= nr, by Lemma 8, we have We have ij follows the hypergeometric distribution with parameters nr, a i , b (r) j by the law of large numbers, we have Observing that m (i) We apply Lemma 8 conditionally on (W k ) k<i , to obtain for i = 2, . . ., I, Z where Z i ∼ N (0, Σ i ) with By Lemma 7, we have where Z 1 , . . ., Z I are independent.By direct calculation, it is easy to see that Thus, where Σ = (Σ k,l i,j ).Σ k,l i,j denotes covariance of jth coordinate of ith consecutive subvector of the length J of Q with kth coordinate of the lth subvector.Thus Since no row is distinguished, in order to establish (7) it is enough to consider i = 1 and k ∈ {1, 2}.We have We prove now Theorem 5.The proof follows [14] and it is based on the multivariate Berry-Esseen theorem ( [15]).
We define Σx ,y ,z x,y,z = n(Cov * (p * (x, y, z))x,y,z ) x ,y ,z x,y,z and Q . As p(x, y, z) > 0 for all (x, y, z), the matrix Σ−M is invertible, cf.e.g.[16].One element of the vector p * is omitted to ensure that the covariance matrix is invertible.As we have x,y,z p * (x, y, z) = 1, the full dimension matrix Σ is singular.Then we apply multivariate Berry-Esseen theorem ( [15]) and d = M − 1.We notice that as ptci → p ci and Σ−M → Σ −M a.s., where Σ −M denotes the matrix Σ without the last row and the last column, and for all j = 1, 2, . . ., M − 1 We prove now the lemma which is used in the proof of Theorem 6.We compute now ( M 2 ) x ,y ,z x,y,z . The first term in the first bracket is multiplied by the consecutive terms in the second bracket, then the second term in the first bracket and so on: We prove now two lemmas which justify choice of null distributions in the numerical experiments.
Lemma 10 Probability mass function p ci (x, y, z) = p(x|z)p(y|z)p(z) minimises D KL (p||q) over q ∈ P ci defined as P ci = {q(x, y, z) : q(x, y, z) = q(x|z)q(y|z)q(z)}.is positive semi-definite.Now we define elements of matrix R(z) = (r x ,y x,y (z)) x ,y x,y as r x ,y x,y (z) = r x x (z)p(y, z)p(y , z) and we show that R(z) ≥ 0. Namely, for any non-zero vector a = (a(x, y))x,y it holds a R(z)a = where the last inequality follows as R(z) ≥ 0. However, (R) x ,y ,z

Proof
x,y,z = r x ,y ,z x,y,z = r x ,y x,y I(z = z )/p(z), thus for any non-zero vector a = (a(x, y, z))x,y,z we have that a Ra = x,y,z x ,y ,z ax,y,zr x ,y ,z x,y,z a x ,y ,z = x,y,z x ,y ,z ax,y,zr x ,y x,y (z)I(z = z )/p(z)a x ,y ,z = z   x,y x ,y ax,y,zr x ,y x,y (z)a x ,y ,z   /p(z) ≥ 0.
is used for CI testing.We choose B independent permutations in Π, construct B corresponding resampled samples by CP scenario (X * n,b , Y n,b , Z n,b ) for b = 1, 2, . . ., B and calculate the values of statistic T * b = T (X * n,b , Y n,b , Z n,b y ,z x,y,z = I(z = z )p(z) p(x|z)p(y|z)p(x |z)p(y |z) − I(x = x )p(x|z)p(y|z)p(y |z) −I(y = y )p(x|z)p(x |z)p(y|z) + I(x = x , y = y )p(x|z)p(y|z) .

Lemma 4 J
Let ar = (a ) be two vectors with coordinates being natural numbers such that nr := |ar|= |br|.

Fig. 3 :
Fig.3: Histograms of values of probabilities p ci for the four considered models.The vertical dotted line shows the value of probability p ci when all triples (x, y, z) are equally probable. frac in the top row the attained values of significance levels for the asymptotic test are shown.That test significantly exceeds the assumed level α = 0.05.The reason for that is shown in the bottom panel of Figure 5.The red dots represent the mean of 2n CM I based on n = 10 5 samples for each value of frac and the solid line indicates the number of degrees of freedom of the asymptotic distribution of 2n CM I, which for s = 4 equals (|X |−1)(|Y|−1)|Z|= 2

Fig. 4 :
Fig. 4: Attained significance level of the tests based on resampled samples for the considered model p ci corresponding to λ = 1, B = 50 with respect to frac.

Fig. 5 :
Fig. 5: Top panels: Levels of significance for asymptotic test.Bottom panels: comparison of the estimated and assumed number of degrees of freedom in testing procedures: mean of 2n CM I based on 10 5 samples generated according to p ci , mean of 2n CM I *

Fig. 6 :
Fig. 6: Q-Q plots of distribution of CM I versus asymptotic distribution (gray), exact resampling distribution (yellow) based on permutations and χ 2 distribution with an estimated number of degrees of freedom (green) under conditional independence for p ci .For the two last distributions medians of 500 quantiles for resampling distributions each based on 50 resampled samples are shown.Straight black line corresponds to y = x.

Fig. 7 :
Fig. 7: Power of the tests based on resampled samples for the considered model for λ = 0.25, 0.5, 0.75 and B = 50 with respect to frac.

Fig. 8 :
Fig. 8: Comparison of resampling scenarios.Fraction of rejections for CP divided by fraction of rejections for CR for both exact and df estimation tests for λ = 0.5 and B = 50.

Lemma 7
Assume that as r → ∞, P (W

p
ci = p(x|z)p(y|z)p(z) and we define ptci (tci stands for true conditional independence) in the following way ptci (x, y, z) = p(x|z) n(y, z) n(z) n(z) n =: p(x|z)p(y|z)p(z), thus, since p * follows the multinomial distribution with an observation (x, y, z) having a probability equal to ptci (x, y, z), conditionally on the original sample we have that E * p * (x, y, z) = p(x|z)p(y|z)p(z)

5 )
Indeed, D KL (p||q) − D KL (p||p ci ) (Analysis of Conditional Randomisation and Permutation schemes We note that for any z the matrix R(z) defined as ( R(z)) x x = r x x (z) = I(x = x )p(x|z) − p(x|z)p(x |z)

=
x,y x ,y ax,yr x ,yx,y (z)a x ,y =x,y x ,y ax,yr x x (z)p(y, z)p(y , z)a x ,y Conditional mutual information of random variables X and Y given Z = (Z 1 , Z 2 , Z 3 , Z 4 ), joint distribution of which equals p λ = λp ci +(1−λ)p, and p and p ci are characterized by the chosen models and parameters (see text).

Table 1 :
Values of np min , where p min = min (x,y,z) p ci (x, y, z) or p min = min (x,y,z) p(x, y, z) with respect to n. frac values correspond to s = 4.
• The power of estimated df test is consistently larger than exact test, both for CR and CP scenarios.The advantage is usually more significant closer to null hypothesis (larger λ); • There is no significant difference in power between df estimation tests in CR and CP scenarios apart from the region frac ∈ [0.5, 2].The same holds for both exact tests excluding frac ∈ [0.5, 1.5].Moreover, df estimation test for CP scenario has larger power than CR exact test.