## Abstract

Interaction Information is one of the most promising interaction strength measures with many desirable properties. However, its use for interaction detection was hindered by the fact that apart from the simple case of overall independence, asymptotic distribution of its estimate has not been known. In the paper we provide asymptotic distributions of its empirical versions which are needed for formal testing of interactions. We prove that for three-dimensional nominal vector normalized empirical interaction information converges to the normal law unless the distribution coincides with its Kirkwood approximation. In the opposite case the convergence is to the distribution of weighted centred chi square random variables. This case is of special importance as it roughly corresponds to interaction information being zero and the asymptotic distribution can be used for construction of formal tests for interaction detection. The result generalizes result in Han (Inf Control 46(1):26–45 1980) for the case when all coordinate random variables are independent. The derivation relies on studying structure of covariance matrix of asymptotic distribution and its eigenvalues. For the case of 3 × 3 × 2 contingency table corresponding to study of two interacting Single Nucleotide Polymorphisms (SNPs) for prediction of binary outcome, we provide complete description of the asymptotic law and construct approximate critical regions for testing of interactions when two SNPs are possibly dependent. We show in numerical experiments that the test based on the derived asymptotic distribution is easy to implement and yields actual significance levels consistently closer to the nominal ones than the test based on chi square reference distribution.

## References

Agresti A (2003) Categorical data analysis. Wiley, New York

Brown G, Pocock A, Zhao MJ, Luján M (2012) Conditional likelihood maximisation: a unifying framework for information theoretic feature selection. J Mach Learn Res 13:27–66

Chanda P, et al. (2008) Ambience: a novel approach and efficient algorithm for identifying informative genetic and environmental associations with complex phenotypes. Genetics 180:1191–1210

Cordell HJ (2002) Epistasis: what it means, what it doesn’t mean, and statistical methods to detect it in humans. Hum Mol Genet 11(20):2463–2468

Cordell HJ (2009) Detecting gene-gene interactions that underlie human diseases. Nat Rev Gen 10(20):392–404

Darroch J (1974) Multiplicative and additive interaction in contingency tables. Biometrika 9:207–214

Duggal P, Gillanders E, Holmes T, Bailey-Wilson J (2008) Establishing an adjusted p-value threshold to control the family-wide type 1 error in genome wide association studies. BMC Genomics 9:516–613

Fano F (1961) Transmission of information: statistical theory of communication. MIT Press, Cambridge

Han TS (1980) Multiple mutual informations and multiple interactions in frequency data. Inf Control 46(1):26–45

Lin D, Tang X (2006) Conditional infomax learning: an integrated framework for feature extraction and fusion. European Conference on Computer Vision

Matsuda H (2000) Physical nature of higher-order mutual information: intrinsic correlations and frustration. Phys Rev E - Stat Phys Plasmas Fluids Related Interdiscip Topics 62(3 A):3096–3102

McGill WJ (1954) Multivariate information transmission. Psychometrika 19 (2):97–116

Meyer P, Schretter C, Bontempi G (2008) Information-theoretic feature selection in microarray data using variable complementarity. IEEE J Selected Topics in Signal Process 2:261–274

Mielniczuk J, Rdzanowski M (2017) Use of information measures and their approximations to detect predictive gene-gene interaction. Entropy 19:1–23

Mielniczuk J, Teisseyre P (2018) A deeper look at two concepts of measuring gene-gene interactions: logistic regression and interaction information revisited. Genet Epidemiol 42(2):187–200

Moore JH, Gilbert JC, Tsai CT, Chiang FT, Holden T, Barney N, White BC (2006) A flexible computational framework for detecting, characterizing, and interpreting statistical patterns of epistasis in genetic studies of human disease susceptibility. J Theor Biol 241(2):256–261

Nelsen R (2006) An introduction to copulas, 2nd edn. Springer, London

Schott J (1997) Matrix analysis for statistics wiley series in probabiliy and statistics. Wiley, New York

SNPsyn (2011) Data set GSE8054 http://snpsyn.biolab.si/examples/gse8054.tab.gz, (date of access: August 29, 2019)

Sucheston L, Chanda P, Zhang A, Tritchler D, Ramanathan M (2010) Comparison of information-theoretic to statistical methods for gene-gene interactions in the presence of genetic heterogeneity. BMC Genom 11:1–12

Tan A, Fan J, Karikari C, et al (2008) Allele-specific expression in the germline of patients with familial pancreatic cancer: an unbiased approach to cancer gene discovery. Cancer Biol Ther 7:135–144

Tsamardinos I, Borboudakis G (2010) Permutation testing improves on Bayesian network learning. In: Proceedings of ECML PKDD 2010, pp 322–337

Wan X, Yang C, Yang Q, Xue T, Fan X, Tang N, Yu W (2010) Boost: a fast approach to detecting gene-gene interactions in genome-wide case-control studies. Amer J Human Genetics 87(3):325–340

Zhang JT (2005) Approximate and asymptotic distributions of chi-squared type mixtures with applications. J Am Stat Assoc 100(469):273–285

## Acknowledgements

We are grateful to Łukasz Smaga for pointing out (Zhang 2005) to us.

## Author information

### Authors and Affiliations

### Corresponding author

## Additional information

### Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

## Appendix

### Appendix

Below we prove Lemma 7.

###
*Proof*

From the previous lemma we know that *λ*_{1}(**D**) = … = *λ*_{4}(**D**) = 1, *λ*_{9}(**D**) = 0.

By lengthy calculations we obtain a sequence of the following equalities:

where \(\mathbf {Z}=(Z_{ij}^{i^{\prime }j^{\prime }})\) and:

Note that e.g. \(\text {tr}(\mathbf {D}) = (n_{X_{1}}-1)(n_{X_{2}}-1)\) follows from Lemmas 3-5 as it follows from Lemma 4 that *t**r*(**M**) = −*t**r*(**C**) ⋅ *t**r*(**D**) and \(tr(\mathbf {M})=(n_{X_{1}}-1)(n_{X_{2}}-1)(n_{Y}-1)\) and *t**r*(**D**) = 1 − *n*_{Y}. As trace of square matrix is sum of its eigenvalues from the above equalities it follows that:

From the Newton-Girard identities we obtain:

This means that *λ*_{5}(**D**),*λ*_{6}(**D**),*λ*_{7}(**D**),*λ*_{8}(**D**) are the roots of the polynomial:

Note that in view of its definition *H*_{1} ≥ 0 and if *X*_{1} and *X*_{2} are independent then *H*_{1} = *H*_{2} = Δ = 0. From this the lemma follows. □

## Rights and permissions

**Open Access** This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

## About this article

### Cite this article

Kubkowski, M., Mielniczuk, J. Asymptotic Distributions of Empirical Interaction Information.
*Methodol Comput Appl Probab* **23**, 291–315 (2021). https://doi.org/10.1007/s11009-020-09783-0

Received:

Revised:

Accepted:

Published:

Issue Date:

DOI: https://doi.org/10.1007/s11009-020-09783-0

### Keywords

- Interaction information
- Asymptotic weighted chi square distribution
- Linkage disequilibrium
- Interaction detection

### Mathematics Subject Classification (2010)

- Primary 62G20
- Secondary 60E99