Asymptotic Distributions of Empirical Interaction Information

Interaction Information is one of the most promising interaction strength measures with many desirable properties. However, its use for interaction detection was hindered by the fact that apart from the simple case of overall independence, asymptotic distribution of its estimate has not been known. In the paper we provide asymptotic distributions of its empirical versions which are needed for formal testing of interactions. We prove that for three-dimensional nominal vector normalized empirical interaction information converges to the normal law unless the distribution coincides with its Kirkwood approximation. In the opposite case the convergence is to the distribution of weighted centred chi square random variables. This case is of special importance as it roughly corresponds to interaction information being zero and the asymptotic distribution can be used for construction of formal tests for interaction detection. The result generalizes result in Han (Inf Control 46(1):26–45 1980) for the case when all coordinate random variables are independent. The derivation relies on studying structure of covariance matrix of asymptotic distribution and its eigenvalues. For the case of 3 × 3 × 2 contingency table corresponding to study of two interacting Single Nucleotide Polymorphisms (SNPs) for prediction of binary outcome, we provide complete description of the asymptotic law and construct approximate critical regions for testing of interactions when two SNPs are possibly dependent. We show in numerical experiments that the test based on the derived asymptotic distribution is easy to implement and yields actual significance levels consistently closer to the nominal ones than the test based on chi square reference distribution.


Introduction
Detection of interactions between explanatory variables occurring when predicting a response is an research issue of a fundamental importance which attracted widespread interest. This is due to recognition of the fact that frequently in natural sciences finding the main effects of predictors is not sufficient to explain their overall effect. When two predictors are considered, which is the case we focus on in the paper, one frequently observes their synergetic effect when their joint influence is larger than a sum of the individual effects. In the opposite case, one of the predictors may inhibit the effect of the other meaning that they together affect the response less strongly than the sum of their main effects would indicate. In particular, it may happen that neither of predictors influences individually the response (e.g. occurrence of a certain disease) but jointly they enhance or inhibit occurrence of its certain values. The premier example is detection of gene-gene or gene-environment interactions sought after in Genome Wide Association Studies (GWAS), see e.g. Cordell (2002Cordell ( , 2009 In the case of a binary response the most popular approach to quantify strength of interaction is a method based on a general methodology due to R. Fisher which consists in fitting a logistic model and testing whether some of the coefficients corresponding to interaction terms are nonzero (see e.g. Agresti 2003). Many other measures have been proposed and there is no general agreement which one reflects adequately biochemical or physiological interaction between genes. However, one of indices which has gained popularity is an entropy-based Interaction Information (I I ) measure defined in Section 2. This is due to its intuitive definition and the fact that it is a nonparametric measure, thus its interpretation does not depend on any parametric model, which has to fit the data at hand. It is now routinely used in GWAS, see e.g Moore et al. (2006) or Chanda et al. (2008), Sucheston et al. (2010), where AMBIANCE package based on I I is described. Moreover, it has been shown formally in Mielniczuk and Teisseyre (2018) by introduction of a partial order on a set of interaction measures, that in many situations Information Interaction is more discriminative than logistic interaction, in particular there may exist nontrivial interactions detected by I I in additive logistic regression model which does not contain logistic interactions.
Besides a direct use for detecting of interactions, Interaction Information is widely used in related contexts such as variable selection based on mutual information, where many criterion functions used for selection take interactions into account. We mention e.g. CIFE criterion (cf Lin and Tang 2006) used in greedy selection which for a given candidate predictor is defined as a sum of its mutual information with the response and all information interactions between the candidate and already chosen variables. Many similar methods involving I I are reviewed in Brown et al. (2012), see also Meyer et al. (2008).
One of the problems which hindered wider use of I I in interaction detection is that statistical properties of its sample counterparts were not fully understood. In particular the asymptotic distribution of the plug-in estimate of I I , called I I further on, was not known for the general case when I I vanishes and thus it was not possible to construct correct rejection regions for such test. In practice, chi square distribution with appropriate number of degrees of freedom is used relying on the result due to Han (cf. Han 1980) who derived asymptotic distribution of I I under the assumption of overall independence. This however may lead to a large number of false signals when the pertaining test is employed as the overall independence is only a very special situation when I I vanishes. In particular, when one would like to test H 0 against the inhibition effect H 1 : I I < 0, test based on chi square reference distribution is intuitively inadequate, as such distribution is supported on positive half-line whereas the left tail of I I extends to negative values, resulting in too large critical thresholds. It was also noticed that when dependence between predictors is large then sampling distribution of I I deviates strongly from chi square distribution.
The present paper attempts to partially solve this problem when both predictors and the response are nominal variables by deriving the explicit form of asymptotic distribution of I I under more general scenario than in Han (1980) which allows for the dependence of predictors. The derivation is algebraic in nature and relies on studying structure of some matrices related to the asymptotic distribution and its eigenvalues. Situation of dependent predictors occurs frequently, e.g. in Genome Wide Associations Studies when dependence of SNPs (linkage disequilibrium) in close proximity is due to crossing-over mechanism and recent methods try to accommodate it (Duggal et al. 2008). It is shown in Theorem 2 that in the case when the underlying distribution is different from its Kirkwood approximation, for a sample consisting of n elements, the asymptotic distribution of n 1/2 ( I I − I I ) is approximately normal. In the opposite case, which includes the situation when both predictors are independent of the response, the distribution of 2n I I is close to a certain weighted chi square distribution (cf. Section 3.2). For cases of special interest in GWAS studies, when predictors assume three values, the weights of the distribution are determined and the appropriate test is proposed. It is shown by simulations that for the proposed test of independence between predictors and the response actual significance levels (type I errors) are much closer to the nominal ones than significance levels for the test based on chi square distribution. In order to illustrate this point we give below actual significance levels when distribution of two predictors pertains to Clayton copula discussed in Section 4 with parameter θ reflecting dependence and predictors are independent of the binary response. Proposed tests denoted W (λ 1 ,λ 2 ) and Z(ˆ ) pertain to asymptotic distribution of I I derived in the paper and its approximation, respectively. They are discussed in Section 4. Chi square test is based on reference chi square distribution with 4 degrees of freedom and nominal significance level α = 0.05. We see that the difference between actual and nominal levels is much smaller for the proposed tests than for chi square test (Table 1).
The paper is organized as follows. In Section 2 we introduce some information-theoretic concepts and discuss delta method which is the main technical tool to derive the distribution of I I . In Sections 3.1 and 3.2 we discuss convergence to normal and weighted chi square laws and reprove Han's result using considered approach. In Section 3.3 we characterize asymptotic law for the case when both predictors assume three values. In Section 4 we introduce resulting test for interaction detection and check its actual levels for several trivariate distributions. Investigation of the power of the proposed test is left for a future research.

Definitions and Notation
We will consider three-dimensional nominal variable (X 1 , X 2 , Y ) with distribution P = P X 1 ,X 2 ,Y such that X 1 , X 2 and Y have correspondingly n X 1 , n X 2 and n Y possible values. Let N = n X 1 ×n X 2 ×n Y . We purposefully denote the third coordinate as Y and not X 3 in order to underline the fact that in presented applications two first coordinates correspond to the values of explanatory variables (e.g. Single Nucleotide Polymorphisms (SNPs)) which we use to predict the value of Y . We will denote by p ij k probability P (X 1 = x i , X 2 = x j , Y = y k ). Accordingly, with a slight abuse of notion, p ij , p ik , p jk will denote corresponding bivariate probabilities i.e. p ij = P (X 1 = x i , X 2 = x j ) and p i , p j , p k marginal probabilities i.e. p j = P (X 2 = x j ). We stress that the distributions of X 1 and X 2 may differ. We will use shorthand notation p for (p ij k ), (p ij ) or (p i ) depending on the context and throughout assume that all p ij k are positive. We assume that we observe n independent samples from P and denote by n ij k number of samples with a particular value (x i , x j , y k ). Then vector (n ij k ) has multinomial distribution Mult (n, p), where p is N dimensional vector of probabilities. Letp ij k = n ij k /n.

Interaction Information
First we recall some concepts developed in Information Theory. Entropy of P is defined as with H (X 1 , X 2 ) and H (X 1 ) defined accordingly. Also, we consider Mutual Information which is a measure quantifying the amount of information obtained about one random variable due to the knowledge of the other random variable. It is defined as and thus can be regarded as measure of association for a pair of discrete variables. It determines how different the joint distribution is from the product of factored marginal distributions. More specifically, it is equal to Kullback-Leibler (KL) divergence KL(P X 1 ,X 2 ||P X 1 × P X 2 ) between these two distributions and can be also interpreted as averaged KL divergence between conditional distribution P X 1 |X 2 and P X 1 . The conditional Mutual Information is Note that the conditional Mutual Information is Mutual Information of X 1 and X 2 given Y averaged over values of Y . In the same vain as above I (X 1 ; X 2 |Y ) can be interpreted as the expected decrease of amount of uncertainty of (X 1 , X 2 ) when Y is known. The conditional Mutual Information is equal zero if and only if predictors are conditionally independent given the outcome Y . Analogously to Eq. 1, we define the Mutual Information between a pair (X 1 , X 2 ) and Y as The main object of interest here, namely Interaction Information (I I ) (McGill 1954;Fano 1961) is defined as It follows from the above definition that I I can be interpreted as a part of Mutual Information of (X 1 , X 2 ) and Y which is due solely to interaction between X 1 and X 2 in predicting Y i.e. the part of I [(X 1 , X 2 ); Y ] which remains after subtraction of individual informations between Y and X 1 and Y and X 2 . In other words, I I is obtained by removing the main effects from the term describing the overall dependence between Y and the pair (X 1 , X 2 ). Below we discuss some properties of I I , which will be used to prove our main results in the next section. Standard representation of I I (X 1 ; X 2 ; Y ) is which also shows that I I (X 1 ; X 2 ; Y ) is symmetric. It is known that I I is closely related to so-called Kirkwood superposition approximation (Matsuda 2000; Teisseyre 2018) defined as distribution P K corresponding to mass function where the upper index K stands for Kirkwood. Namely, it follows from Eq. 4 that the Interaction Information can be written using Kullback-Leibler divergence between the joint distribution of X 1 , X 2 , Y and its Kirkwood superposition approximation: Note that Kirkwood approximation is non-normalized i.e. it is not necessarily probability distribution. We define η = i,j,k p K ij k to be the normalizing constant for P K . The second important property of the Interaction Information is which indicates that I I measures the influence of a variable Y on the amount of information shared between X 1 and X 2 . In other words it quantifies how much Y influences the dependence between X 1 and X 2 .
Observe that it follows from Eq. 7 that I I in contrast to the Mutual Information can be either positive or negative. Positive value of I I indicates that interactions between X 1 and X 2 enhance prediction of Y whereas negative values indicate that interactions diminish or inhibit such prediction. In other words, the conditional dependence is stronger than the unconditional one. The negative value of I I indicates that Y weakens or inhibits the dependence between X 1 and X 2 . For more detailed discussion and examples of positive and negative I I we refer to Mielniczuk and Teisseyre (2018).
Empirical versions of the introduced quantities are defined as their plug-in variants. In particular, we haveĤ wherep ij k = n ij k /n andp ij andp i ,p j are defined analogously. Moreover, Note that n(p 1 , . . . ,p n X 1 ) has multinomial distribution Mult (n, (p 1 , . . . , p n X 1 )) and analogous property holds for n(p 11 , . . . ,p n X 1 n X 2 ). This allows us to to use a particular version of delta method stated in Theorem 1. Asymptotic distribution of empirical version of I I has been derived by Han in his paper (Han 1980) for a special case when I I vanishes, namely when all variables X 1 , X 2 and Y are independent. The distribution is is chi-square with (n X 1 −1)(n X 2 −1)(n Y −1) degrees of freedom. However, this case is too restrictive for testing purposes when we would like to allow for dependence between predictors X 1 and X 2 . No significant progress has been made in this direction since Han's seminal paper. The main technical difficulty in deriving the distribution of I I is that its distribution does not follow directly from knowledge of asymptotic distributions of the terms on the rhs of Eq. 8 due to dependence between them. In the paper, we prove that if distribution of (X 1 , X 2 , Y ) is different from its Kirkwood approximation, the limiting distribution of I I is normal. In the case when these distributions coincide, the asymptotic distribution, obtained for different normalization of I I coincides with distribution of sum of weighted chi square variables. Finally, we note that modifications of I I are used for interaction detection. For example, the measure defined as Kullback-Leibler divergence from normalized Kirkwood superposition approximationP K = P K /η equals can be used as a measure of interaction strength (Wan et al. 2010; Rdzanowski 2017).

Delta Method
Recall that N = n X 1 × n X 2 × n Y . We present here an useful method for determining asymptotic distribution of a certain function of cell frequencies f (p) wherep = (p ij k ) = (p 1 , . . . ,p N ) is a vector of frequencies pertaining to multinomial distribution Mult (n, p).
It is based on the following Taylor expansion where Df (p) and D 2 f (p) denote the first and the second derivative at p and r n is the remainder term. The asymptotic distribution of ). Under appropriate conditions it is shown that the first term of the expansion determines the asymptotic law, which in this case is zero mean normal with a certain variance σ 2 . When σ 2 equals 0, the second term determines the law after increasing normalisation from n 1/2 to n. The formal result is as follows (see e.g. Agresti 2003 for part (i), with (ii) being an obvious extension of (i). Let d → denote convergence in distribution and I (A) is the indicator function of event A. Then when n → ∞, where Note that the variance in Eq. 12 coincides with the variance of random variable which takes value ∂f/∂p i with probability p i . Let H = D 2 f (p). Observe that the distribution of the limit in Eq. 13 can be determined by writing in view of spectral decomposition (see e.g. Schott 1997, p. 95 where x i and λ i are eigenvectors and corresponding eigenvalues of 1/2 H 1/2 or equivalently H . As x i are orthonormal, random variables x i Z are independent and N(0, 1) distributed, thus the distribution in Eq. 14 equals distribution of a sum of weighted centred chi square variables, which we will call weighted centred chi square distribution.

Convergence to Normal Law
We begin by studying n 1/2 -convergence of I I and determine when for this normalisation the corresponding limiting normal law is non-degenerate. The equivalent condition is given in terms of probability vector p = (p ij k ). We recall that mass function of P K is defined in Eq. 5.

Theorem 2 We have (i)
where (ii) σ 2 I I equals 0 if and only if P = P K . Proof Note that in view of Eq. 8 the function f (p) pertaining to I I is Denote the first term in the above decomposition by f 1 (p) = i,j,k p ij k ln(p ij k /p ij p k ) and note that This follows from differentiating all summands involving p ij k that is being functions of p ij , p ij , p i j or p k . Similarly the derivative of the second term in Eq. 16 equals − ln(p ik /p i p k ) + 1 and the derivative of the third is analogous with i replaced by j . Then and thus (12) which coincides with σ 2 I I . In order to prove (ii) note that σ 2 I I = 0 is equivalent to p ij k /p K ij k ≡ C. We show that C = 1. Indeed, summing over k we have summing over j we thus have Remark 1 Note that that P = P K is a stronger condition than I I = 0 and both are equivalent when normalizing constant η of Kirkwood approximation does not exceed 1 (cf Mielniczuk and Teisseyre 2018). It is also shown there that P = P K implies that P is socalled perfect distribution (cf Darroch 1974). In particular, we have that when (X 1 , X 2 ) are independent of Y then P = P K and σ 2 I I = 0. Note that the case when I I = 0 and σ 2 I I > 0 is possible. This makes behaviour of I I when I I = 0 quite intricate as both n −1/2 and n −1 -convergence rates are possible.
The situation when σ 2 I I = 0 is studied in detail in the next section.

Convergence to Weighted Chi Square Distribution
In view of Theorem 1 and Eq. 14 when P = P K we have that 2n where Z i are independent N(0, 1) random variables and λ i are eigenvalues of H , where and H are defined in Section 2.3. Thus asymptotically 2n I I is distributed as weighted centred chi-square distribution. We now study the structure of H in more detail. For any matrix A with row and column indices in {1, . . . , n X 1 }×{1, . . . , n X 2 }×{1, . . . , n Y } we will denote by A i j k ij k its element with row index equal to ij k and column index i j k . In order to keep the notation consistent B ij will be denoted by B , |s| denotes number of elements of s and inclusion s ⊆ z is meant for each index i, j and k separately. This can be directly checked by differentiating (17).
Lemma 2 Matrix M := H has the following form: Using the previous lemma we have From now on we will assume that (X 1 , X 2 ) and Y are independent. The next lemma states the representation of M as a Kronecker product of two matrices (for the definition of Kronecker product and its properties we refer to Schott 1997).
where ⊗ is Kronecker product, Proof Independence of (X 1 , X 2 ) and Y implies that: (−1) |s| p s . This ends the proof.

Lemma 5
The following properties of C hold: Proof Denote locally in the proof P (Y = y i ) by p(y i ) , i = 0, . . . , n Y − 1. y 0 ), . . . , p(y n Y −1 )) T . We observe that p T n Y 1 n Y = 1 and thus we have: This means that I n Y − C is idempotent and consequently, C is idempotent. To prove (ii), note that tr(C) = n Y − 1 and 1 n Y is an eigenvector with eigenvalue λ n Y (C) = 0. As C is idempotent this means that We now show that the proved properties imply Han's theorem (cf Han 1980).

Theorem 3
If X 1 , X 2 , Y are all independent, then: Proof It is enough to show that M is idempotent since then from Lemma 3 it follows that it has (n X 1 − 1)(n X 2 − 1)(n Y − 1) eigenvalues equal 1 and remaining ones are 0. Thus the asymptotic distribution of 2n I I is χ 2 (n X 1 −1)(n X 2 −1)(n Y −1) , as P K = P in the case of independent variables and thus σ 2 I I = 0. Analogously, as in the proof of Lemma 4, we observe that: This means, that M = A ⊗ B ⊗ C, where: and C is defined in Lemma 5. From the proof of Lemma 5 we know that A, B, C are idempotent. Hence from the mixed-product property of Kronecker product it follows that: Thus M is idempotent.
Proof Eigenvector for the eigenvalue 0 is 1 n X 1 n X 2 , as for each row of D indexed by i, j we have: Moreover, for any i 0 , j 0 , such that i 0 = n X 1 , j 0 = n X 2 we define the vector α = (α ij ): We show that α is an eigenvector of D corresponding to eigenvalue 1. Indeed, observe that Thus α is an eigenvector of D. Since there are (n X 1 − 1)(n X 2 − 1) such vectors for different i 0 , j 0 and they are linearly independent, the lemma is proved.

Special Cases
We state now the result which is the main conclusion of the paper.
In the special case when Y is binary i.e n Y = 2 we have where T 1 ∼ χ 2 4 and Z 1 , . . . , Z 4 are independent N(0, 1) random variables independent of T 1 .
We discuss now the case when n X 1 = n X 2 = 3 and n Y equals 2 which corresponds to an important case of two SNPs interacting with an binary outcome. We will show that in this case, or more generally, when Y admits n Y values, the distribution of Y can be explicitly described.
Technical proof of the lemma is moved to the Appendix.
Remark 2 (i) Note that all eigenvalues λ i (D) are real as they are eigenvalues of the symmetric matrix 1/2 H 1/2 , thus it follows that ≥ 0. We also remark that may be 0 for dependent X 1 , X 2 as it happens for X 1 = X 2 having any nondegenerate distribution on {1, 2, 3} since then H 1 = H 2 = 2. Furthermore, for arbitrary n X 1 and n X 2 from the inequality it follows that H 1 ≤ min(n X 1 , n X 2 )−1 and the upper bound is attained. The same inequality holds for H 2 . Besides, from the Jensen inequality we have (ii) Moreover, observe that H 2 defined in the last Lemma can be represented as Note also that H 1 equals chi square index of the distribution of (X 1 , X 2 ).
By using the same method as for Lemma 7, we obtain

Corollary 2
The following statements hold: Remark 3 We stress that in Theorem 3 the response Y can take one of an arbitrary number of discrete values, thus the result is applicable e.g. to multiclass classification problems. For the case when X 1 or X 2 admit more than 3 values we note that matrix D defined in Lemma 4 has at least (n X 1 − 1)(n X 2 − 1) eigenvalues equal to 1 and one equal to 0. Thus in order to determine the distribution of W we need to compute n X 1 +n X 2 −2 remaining eigenvalues of D. In view of the proof of the lemma this would involve determining n X 1 +n X 2 −2 powers of D which is computationally challenging for larger values of n X i . However, as a polynomial Q (cf Eq. 21) has in general case every second coefficient equal to 0, we conjecture that explicit formulae for λ i should be calculable for n X 1 + n X 2 ≤ 10. The alternative method of determining distribution of W using permutations is described in Remark 4.

Numerical Experiments
In the following we apply Theorem 4 to construct test for the hypothesis that (X 1 , Y 1 ) are independent of Y : H 0 : (X 1 , X 2 ) ⊥ Y against two-sided alternative H 1 : I I = 0 Note that it follows from the discussion in Section 2.2 that H 0 implies I I = 0 and thus the null and the alternative hypotheses describe disjoint events. Such set-up is useful in cases when a researcher focuses on detection of interaction between explanatory variables in predicting the response and not on their main effects. We remark that it would be also desirable to develop tests for a broader null hypotheses H : X 1 ⊥ Y &X 2 ⊥ Y & I I = 0 (no interaction and no main effects), but the asymptotic distributions of I I under such null hypotheses are intractable. Note that H 0 allows for dependence of explanatory variables which is a common case e.g. in Genome Wide Association Studies (GWAS).
To fix ideas, we consider here the case of a binary response Y i.e. the case when n Y = 2. We consider the test with critical region where W α/2 is an quantile of order α for distribution W defined in Theorem 4, that is we reject H 0 when the observed value I I belongs to C. It follows from Theorem 4 that such test has asymptotic significance level α. We investigate how the quantiles needed in test construction can be approximated and in consequence how well actual significance levels pertaining to asymptotic quantiles correspond to nominal ones. We also consider an analogous test based on χ 2 4 approximation frequently used in GWAS (cf. e.g. Chanda et al. 2008 andWan et al. 2010). We note that the asymptotic distribution of 2n I I is χ 2 4 only under assumption that all three variables X 1 , X 2 and Y are independent. This is very restrictive and the null hypothesis above is is much more general. We will show in the following that using χ 2 4 distribution instead of distribution of W leads to significant increase of false rejections under H 0 , especially in the case when two-sided alternative is considered.

Approximation of Distribution of W
In the following we denote by W the limiting distribution in Theorem 4. Note that for the 3 × 3 × 2 case W is parametrized by λ 5 and λ 7 i.e. W = W (λ 5 , λ 7 ), where λ 5 , λ 7 are given by Eq. 18. In order to simplify the notation we letλ 1 := λ 5 andλ 2 := λ 7 . Thus for testing purposes W (λ 1 ,λ 2 ) is approximated by W (λ 1 ,λ 2 ), whereλ i are the plug-in estimators ofλ i based on Eq. 18. The values of quantiles of W (λ 1 ,λ 2 ) can be numerically calculated using R package distr or Python package Pacal. The differences between the results are minor and the results presented below are obtained with the aid of distr.
We checked that for sample sizes larger than 100 and when the fixed type of dependence is considered (and thusλ 1 andλ 2 are fixed), quantiles of order 0.95 and 0.975 as well as 0.025 and 0.05 of W (λ 1 ,λ 2 ) are very well approximated by quantiles of W (λ 1 ,λ 2 ). In Table 2 below we show how expected values of quantiles W (λ 1 ,λ 2 ) compare with quantiles of W (λ 1 ,λ 2 ) when distribution of (X 1 , X 2 ) is given by normal copula discussed below for n = 1000, P (Y = 1) = 0.9 and number of repetitions L = 1000. Parameter θ corresponds to correlation coefficient. The analogous results for Clayton copula with parameter θ discussed in Section 4.2 are given in Table 3. Thus, despite more complicated form of distribution of W (λ 1 ,λ 2 ) than chi square distribution approximation of its quantiles does not pose any computational difficulties. Moreover, it is known that distributions of weighted chi square random variables can be adequately approximated by αχ 2 d + β, where χ 2 d is generalized chi square distribution with d ∈ R + and parameters α, d and β are chosen to match three first moments of W (λ 1 ,λ 2 ) (see Zhang 2005). It follows from formulas (5) in Zhang's paper that in the case considered here when the first four non-zero eigenvalues are 1 and remaining four form two symmetric pairs ±λ i , the approximating distribution depends only on =λ 2 1 +λ 2 2 and thus it is called Z( ). Analogously, plug-in version of Z( ) is denoted by Z(ˆ ). It will follow that for purposes of approximating critical region C quantiles of Z(ˆ ) can be used instead that of W (λ 1 ,λ 2 ) (see Tables 2 and 3) which also contains values of quantiles of corresponding Z( ) and averaged values of quantiles of Z(ˆ )). The only exception is when left tail is considered and the dependence is strong (see e.g. the results in Table 2 for quantiles of order 0.025 and 0.05 and θ = 0.9).
When W (λ 1 ,λ 2 ) is compared with χ 2 4 distribution we see that in the case of normal copula mainly the left tails differ when the dependence becomes strong. In the case of Clayton copula both tails of these two distribution are significantly different when θ is large.
Remark 4 When X 1 or X 2 admit more than three values we established that distribution of W coincides with distribution of weighted sum of independent chi squares with weights λ i . As λ i s are unknown the distribution of W can be alternatively approximated using permutation method. This would involve calculation of L permutations of the original sample with (X 1 , X 2 ) being permuted. Then as obtained samples satisfy null hypothesis H 0 the empirical distribution of values I I 1 , . . . , I I L will approximate that of W . A main problem with this approach is that as we are actually interested in quantiles W α/2 and W 1−α/2 we would need large number of permuted sampled to approximate them precisely which is computatinally demanding for larger n (for discussion of this in the context of testing for conditional independence see e.g. Tsamardinos and Borboudakis 2010). The way to circumvent this problem would be to approximate permutation distribution using weighted chi square distribution which would involve only estimation of moments and would require significantly smaller number of permutations. This problem will be the subject of of future research.

Dependence Models of (X 1 , X 2 )
In order to check how dependence between X 1 and X 2 affects the asymptotic distribution W and the way it influences actual significance levels of the tests based on I I we investigated discretized versions of four distributions pertaining to popular copulas. Copula C is defined as a bivariate function defined on the unit square which satisfies where F is a distribution function corresponding to P X 1 ,X 2 and F 1 and F 2 are the corresponding marginal distribution functions. Here we consider normal copula together with Clayton, Gumbel and Frank copulas. They are described in standard reference texts (see e.g. Nelsen 2006), here it suffices to state that they are all parametrized by a parameter θ , which in case of discretized normal copula corresponds to correlation coefficient of original normal variables. We assume that X 1 and X 2 take one of the values 1,2 or 3, P (X 1 = 1) = 0.25, P (X 1 = 2) = 0.5, P (X 1 = 3) = 0.25 and X 2 d = X 1 . Thus the marginals satisfy Hardy-Weinberg hypothesis with p = q = 0.5. The distribution function given in Eq. 20 is then discretized to atoms (i, j ) with i, j = 1, 2, 3 e.g. P (X 1 = 1, X 2 = 1) = C(0.75, 0.25), P (X 1 = 2, X 2 = 1) = C(0.75, 0.25) − C(0.25, 0.25) and so on. Binary response Y is generated independently from (X 1 , X 2 ) and such that P (Y = 0) = 0.05. In genetic applications P (Y = 0) may correspond to prevalence of a disease. Thus H 0 : I I = 0 is satisfied.

Methodology and Computing in Applied Probability
It turns out thatλ 1 andλ 2 behave very differently as a function of θ for considered copulas and two introduced distributions (see Fig. 1). Recall that the difference between their squared values is described by √ =λ 2 1 −λ 2 2 . We also show in Fig. 2 how parameter θ influences dependence between X 1 and X 2 measured by their mutual information. Note that dependence of X 1 and X 2 corresponds to linkage disequilibrium which is of interest in genetics.
Moreover Fig. 3 shows the the discrepancies between the true distribution and its two approximations for quasi-diagonal copula. It is worthwhile to compare this figure and the corresponding panel in Fig. 4 for this distribution to see the influence of the lack of fit on actual type I error. It can be also seen that that the approximation of the empirical distribution by the asymptotic distribution is the least accurate for the strongest dependence between predictors (θ = 0.12 Fig. 2).

Actual Significance Levels
Below we present analysis how significance levels of the test based on I I differ from the nominal levels when the reference distribution is either W (λ 1 ,λ 2 ) or χ 2 4 . The figures below are based on L = 1000 repetitions, for each repetition critical values for critical region C in Eq. 19 were calculated based on W (λ 1 ,λ 2 ) distribution. Nominal level was set at α = 0.05. It follows from the figures that the proposed test based on W (λ 1 ,λ 2 ) distribution yields actual levels of significance much closer to the nominal one than the test based on χ 2 4 distribution. This is due to mainly to much better approximation of the left tail of the distribution of I I , which extends to the negative values, by the left tail of W (λ 1 ,λ 2 ). In contrast the Fig. 2 Behaviour of I (X 1 , X 2 ) for different copulas

Fig. 3
Quasi-diagonal copula -comparison of empirical density of 2n I I with the densities of χ 2 4 and W (λ 1 ,λ 2 ) for n = 4000 and P (Y = 1) = 0.95 distribution χ 2 is supported on positive part of a real line and it yields poor approximation of lower quantiles of distribution of I I . Upper quantiles of I I are moderately well approximated by that of χ 2 and this explains why for one sided alternative H 1 : I I > 0 the differences between nominal level based on χ 2 4 approximation and actual leveles are often smaller (cf Mielniczuk and Rdzanowski 2017). In contrast, for two-sided tests approximation of both lower and upper quantiles plays significant role and then test based on chi square approximation performs poorly. We note that in all the cases considered actual significance levels for chi square tests were larger than for the proposed test and larger than the nominal level 0.05. Thus when the power of these tests is compared this would lead to an erroneous conclusion that the chi square test is more powerful, which is solely due to the lack of control of the significance level. We also note there are cases when the proposed test, although it performs better than chi square test is still much too liberal even for large sample sizes: see e.g. the case of normal copula for θ = 0.7 and quasi-diagonal distribution for θ = 0.09. These cases correspond to situations when dependence between X 1 and X 2 becomes strong and this affects the speed with which distribution of 2n I I converges to the asymptotic distribution. In general, the control of significance level is more accurate when the asymptotic distribution is close to chi square distribution. This include the cases of weak linkage disequilibrium and the cases when both λ 1 and λ 2 are close to 1.   Table 4 shows computation times for testing H 0 for normal copula with ρ = 0.5. Note that although calculation of W (λ 1 ,λ 2 ) takes from 35 to 3 times longer then that of χ 2 4 test depending on a sample size, the times for χ 2 4 and Z(ˆ ) which is approximation W (λ 1 ,λ 2 of are comparable. We also note that the ratio of computing times for the proposed test and χ 2 4 diminishes with the sample size and is less than 3 for n = 10 5 . Our experiments also indicate that the times depend only insignificantly on dependence of two predictors.

Analysis of a Real-World Data Set
We perform an analysis of a real data set on pancreatic cancer considered in Tan et al. (2008) downloaded from the addrees (SNPsyn 2011). The data consist of 208 observations (121 cases (Y = 1) and 87 controls (Y = 0)) with values of 901 SNPs. We chose predictors with at least two values (there are 499 such predictors in the data set) and consider all K = 499 · 498/2 = 124251 pairs. We have applied all three discussed two-sided tests for all pairs and α = 0.05 with Bonferroni correction resulting in an individual level of significance 0.05/K = 4.03 × 10 −7 . It turns out that W (λ 1 ,λ 2 ) test detects 77 significant interactions, whereas chi square test detected 24283 and Z(ˆ ) test 21289 interactions. The last column gives adjusted p-value times 10 4 Much larger number of the detected interactions in the last two cases stems from the fact that majority of the interactions discovered is negative and their values due to positive support of chi square were considered significant in the case of χ 2 square test. The similar phenomenon occurs for Z(ˆ ). Thus majority of negative interactions discovered by those two tests is likely to be spurious. This suggests that using W (λ 1 ,λ 2 ) for negative interactions will have much smaller false discovery rate. Table 5 shows 10 of the most significant pairs with 7 of them being negative. Note that the most significant pair has negative interaction. Three the highest ranked pairs with positive interactions occupy three first places in ranking with respect to p-values when one sided alternative I I > 0 is considered.

Conclusions
We have derived asymptotic distributions of interaction information for general trivariate nominal distribution (X 1 , X 2 , Y ) as shown that it is weighted chi square distribution and have determined it weights in the case when (X 1 , X 2 ) are independent of binary Y and both X 1 and X 2 have at most three values. We have shown numerically that using quantiles of asymptotic distribution W with estimated parameters yields actual significance level consistently much closer to nominal ones than in the case when quantiles of χ 2 4 distribution are used. This is especially pronounced in the case of two-sided alternative due to significant difference between left tails of χ 2 4 and W (λ 1 ,λ 2 ). This can lead to much larger fraction of false rejections than expected when χ 2 4 based test is used. Putting it differently, one may detect many spurious interactions which do not actually exist. On the negative side, convergence of 2n I I to W (λ 1 ,λ 2 ) can be slow even for large n. We observed such behaviour mainly in situations when dependence between X 1 and X 2 becomes strong. Thus we recommend using the proposed test with critical value base on quantiles of W with estimated parameters when the dependence of both predictors, measured e.g. by their mutual information is not too strong. On the computational side, despite more complicated form of asymptotic distribution then in the case of overall independence construction of a critical region does not pose any significant hurdles.