The main goal of this paper is to propose a novel procedure for testing the positiveness of II. Such a procedure is useful to find pairs of variables \((X_1,X_2)\) that allow to jointly predict Y, even when the main effects are negligible and to confirm the statistical significance of the detected interaction. We state the following proposition which albeit simple, is instrumental for understanding the presented approach.
Proposition 1
If \(Y\perp (X_1,X_2)\), then \(II(X_1,X_2,Y)=0\).
Proof
The independence of \((X_1,X_2)\) and Y implies that \(I((X_1,X_2),Y)=0\) and also \(I(X_1,Y)=I(X_2,Y)=0\). Thus, the assertion follows directly from (1).
Note that although the converse to Proposition 1 is not true i.e. it is possible to have \(II(X_1,X_2,Y)=0\) while \(I((X_1,X_2),Y)>0\) ([19], p. 121) such examples require special constructions and are not typical. Moreover, it follows from (1) that when \(X_1\) and \(X_2\) are individually independent of Y and \(II(X_1,X_2,Y)=0\) then pair \((X_1,X_2)\) is independent of Y. Whence, from the practical point of view hypotheses \(II(X_1,X_2,Y)=0\) and \(I((X_1,X_2),Y)=0\) are approximately equivalent.
Our principal aim is to test the null hypothesis:
$$\begin{aligned} \text {H}_0: II(X_1,X_2,Y)=0, \end{aligned}$$
(3)
against the alternative hypothesis corresponding to the positiveness of \(II(X_1,X_2,Y)\):
$$\begin{aligned} \text {H}_1: II(X_1,X_2,Y)>0. \end{aligned}$$
(4)
In view of the above discussion we replace \(H_0\) by:
$$ {\tilde{H}}_0: Y\perp (X_1,X_2), $$
The main operational reason for replacing \(H_0\) by \({\tilde{H}}_0\) is that the distribution of a sample version of II under null hypothesis \(H_0\) is unknown and determining it remains an open problem. We note that the sample versions of \(I(X_1,X_2)\), \(I(X_1,X_2|Y)\) and \(II(X_1,X_2,Y)\) are simply obtained by replacing the true probabilities by estimated probabilities (i.e. fractions). They will be denoted by \(\widehat{I}(X_1,X_2)\), \(\widehat{I}(X_1,X_2|Y)\) and \(\widehat{II}(X_1,X_2,Y)\), respectively. In contrast to \(H_0\) scenario, it is possible to determine distribution of \(\widehat{II}(X_1,X_2,Y)\) when \({\tilde{H}}_0\) is true using permutation based approach. We note that the latter allows to calculate the distribution of \(\widehat{II}(X_1,X_2,Y)\) with arbitrary accuracy for any sample size n and for fixed sample distribution of Y and \((X_1,X_2)\) while chi-square approximation, even when it is valid, it is accurate only for large sample sizes. In this paper we combine these two approaches: permutation and based on asymptotic distribution. This yields a novel testing method which is computationally feasible (it is not as computationally intensive as permutation based test) and is more powerful than chi-squared test.
3.1 Chi-Squared Test IICHI
The distribution of \(\widehat{II}(X_1,X_2,Y)\) under the null hypothesis (3) is not known. However, in a special case of (3) when all three variables are jointly independent and all probabilities \(P(X_1=x_i,X_2=x_j,Y=y_k)\) are positive, Han [16] has shown that for a large sample size
$$\begin{aligned} 2n\widehat{II}(X_1,X_2,Y)\sim \chi ^2_{(K_1-1)(K_2-1)(L-1)}, \end{aligned}$$
(5)
approximately, where \(K_1, K_2, L\) are the number of levels of \(X_1\), \(X_2\) and Y, respectively. Of course, joint independence of \((X_1,X_2,Y)\) is only a special case of \({\tilde{H}}_0\), which is in turn a special case of (3). Nonetheless the above approximation is informally used to test the positiveness of II under null hypothesis, see e.g. [12]. Thus for this method, we accept the null hypothesis (3) if \(2n\widehat{II}(X_1,X_2,Y)<\chi ^{2}_{(K_1-1)(K_2-1)(L-1),1-\alpha }\), where \(\alpha \) is a significance level. It turns out that if the dependence between \(X_1\) and \(X_2\) increases the distribution of \(2n\widehat{II}(X_1,X_2,Y)\) deviates from \(\chi ^2\) distribution. Thus the \(\chi ^2\) test can be used to test the positiveness of \(II(X_1,X_2,Y)\) only if there is a weak dependence between \(X_1\) and \(X_2\). It follows from our experiments that if the dependence between \(X_1\) and \(X_2\) is strong, then \(\chi ^2\) test tends to reject the null hypothesis too often, i.e. its type I error rate may significantly exceed the prescribed level of significance \(\alpha \).
3.2 Permutation Test IIPERM
The distribution of \(2n\widehat{II}(X_1,X_2,Y)\) under the null hypothesis \({\tilde{H}}_0\) can be approximated using a permutation test. Although \({\tilde{H}}_0\) is a proper subset of (3), the Monte-Carlo approximation is used to test the positiveness of II under hypothesis (3). Observe that permuting the values of variable Y while keeping values of \((X_1,X_2)\) fixed we obtain the sample conforming the null distribution. An important advantage of permutation test is that while permuting the values of Y the dependence between \(X_1\) and \(X_2\) is preserved. We permute the values of variable Y and calculate \(2n\widehat{II}(X_1,X_2,Y)\) using the resulting data. This step is repeated B times and allows to approximate the distribution of \(2n\widehat{II}(X_1,X_2,Y)\) under the null hypothesis \({\tilde{H}}_0\).
Figure 1 shows the permutation distribution (for \(B=10000\)), \(\chi ^2\) distribution and a true distribution of \(2n\widehat{II}(X_1,X_2,Y)\), under the null hypothesis (3), for artificial data M0 (see Sect. 4.1), generated as follows. The pair \((X_1,X_2)\) is is drawn from distribution described in the Table 1, Y is generated independently from the standard Gaussian distribution and then discretized using the equal frequencies and 5 bins. The true distribution is approximated by calculating \(2n\widehat{II}(X_1,X_2,Y)\) for 10000 data generation repetitions (this is possible only for artificially generated data). Since \(X_1\) and \(X_2\) take 3 possible values, we consider \(\chi ^2\) distribution with \((3-1)\times (3-1)\times (5-1)=16\) degrees of freedom. In this experiment we control the dependence strength between \(X_1\) and \(X_2\) and analyse three cases: \(I(X_1,X_2)=0\), \(I(X_1,X_2)=0.27\) and \(I(X_1,X_2)=0.71\). Thus in the first case \(X_1\) and \(X_2\) are independent whereas in the last case there is a strong dependence between \(X_1\) and \(X_2\). First observe that the lines corresponding to the permutation distribution and the true distribution are practically indistinguishable, which indicates that the permutation distribution approximates the true distribution very well. Secondly it is clearly seen that the \(\chi ^2\) distribution deviates from the remaining ones when the dependence between \(X_1\) and \(X_2\) becomes large. Although this nicely illustrates (5) when complete independence occurs, it also underlines that \(\chi ^2\) distribution is too crude when the dependence between \(X_1\) and \(X_2\) is strong. It is seen that the right tail of \(\chi ^2\) distribution is thinner than the right tail of the true distribution and thus the uppermost quantiles of the true distribution are underestimated by the corresponding quantiles of \(\chi ^2\) (Fig. 1). This is the reason why IICHI rejects the null hypothesis too often leading to many false positives. This problem is recognized for other scenarios of interaction detection (cf. [20]). The drawback of the permutation test is its computational cost. This becomes a serious problem when the procedure is applied for thousands of variables, as in the analysis of SNPs.
3.3 Hybrid Test
To overcome the drawbacks of a \(\chi ^2\) test (a significant deviation from the true distribution under the null hypothesis) and a permutation test (high computational cost) we propose a hybrid procedure that combines these two approaches. The procedure exploits the advantages of the both methods. It consists of two steps. We first verify whether the dependence between \(X_1\) and \(X_2\) exists. We use a test for a null hypothesis
$$\begin{aligned} \text {H}_0: I(X_1,X_2)=0, \end{aligned}$$
(6)
where the alternative hypothesis corresponds to the positiveness of MI:
$$\begin{aligned} \text {H}_1: I(X_1,X_2)>0. \end{aligned}$$
(7)
It is known (cf e.g. [21]) that under the null hypothesis (6), we approximately have:
$$ 2n\widehat{I}(X_1,X_2)\sim \chi ^{2}_{(K_1-1)(K_2-1)}, $$
for large sample sizes. If the null hypothesis (6) is not rejected, we apply chi-squared test for \(II(X_1,X_2,Y)\) described in Sect. 3.1. Otherwise we use a permutation test described in Sect. 3.2. In the case of independence (or weak dependence) of \(X_1\) and \(X_2\) we do not perform, or perform rarely, the permutation test, which reduces the computation effort of the procedure. There are three input parameters. Parameter \(\alpha \) is a nominal significance level of the test for interactions. Parameter \(\alpha _0\) is a significance level of the initial test for independence between \(X_1\) and \(X_2\). The larger the value of \(\alpha _0\) is, it is more likely to reject the null hypothesis (6) and thus it is also more likely to use the permutation test. Choosing the small value of \(\alpha _0\) leads to more frequent use of the chi-squared test. This reduces the computational burden associated with the permutation test but can be misleading when chi-squared distribution deviates from the true distribution of \(2n\widehat{II}(X_1,X_2,Y)\) under the null hypothesis (3). Parameter B corresponds to the number of loops in a permutation test. The larger the value of B, the more accurate is the approximation of the distribution of \(2n\widehat{II}(X_1,X_2,Y)\) under the null hypothesis. On the other hand, choosing large B increases the computational burden. Algorithm for HYBRID method is given below.