Abstract
Interaction Information is one of the most promising interaction strength measures with many desirable properties. However, its use for interaction detection was hindered by the fact that apart from the simple case of overall independence, asymptotic distribution of its estimate has not been known. In the paper we provide asymptotic distributions of its empirical versions which are needed for formal testing of interactions. We prove that for three-dimensional nominal vector normalized empirical interaction information converges to the normal law unless the distribution coincides with its Kirkwood approximation. In the opposite case the convergence is to the distribution of weighted centred chi square random variables. This case is of special importance as it roughly corresponds to interaction information being zero and the asymptotic distribution can be used for construction of formal tests for interaction detection. The result generalizes result in Han (Inf Control 46(1):26–45 1980) for the case when all coordinate random variables are independent. The derivation relies on studying structure of covariance matrix of asymptotic distribution and its eigenvalues. For the case of 3 × 3 × 2 contingency table corresponding to study of two interacting Single Nucleotide Polymorphisms (SNPs) for prediction of binary outcome, we provide complete description of the asymptotic law and construct approximate critical regions for testing of interactions when two SNPs are possibly dependent. We show in numerical experiments that the test based on the derived asymptotic distribution is easy to implement and yields actual significance levels consistently closer to the nominal ones than the test based on chi square reference distribution.
References
Agresti A (2003) Categorical data analysis. Wiley, New York
Brown G, Pocock A, Zhao MJ, Luján M (2012) Conditional likelihood maximisation: a unifying framework for information theoretic feature selection. J Mach Learn Res 13:27–66
Chanda P, et al. (2008) Ambience: a novel approach and efficient algorithm for identifying informative genetic and environmental associations with complex phenotypes. Genetics 180:1191–1210
Cordell HJ (2002) Epistasis: what it means, what it doesn’t mean, and statistical methods to detect it in humans. Hum Mol Genet 11(20):2463–2468
Cordell HJ (2009) Detecting gene-gene interactions that underlie human diseases. Nat Rev Gen 10(20):392–404
Darroch J (1974) Multiplicative and additive interaction in contingency tables. Biometrika 9:207–214
Duggal P, Gillanders E, Holmes T, Bailey-Wilson J (2008) Establishing an adjusted p-value threshold to control the family-wide type 1 error in genome wide association studies. BMC Genomics 9:516–613
Fano F (1961) Transmission of information: statistical theory of communication. MIT Press, Cambridge
Han TS (1980) Multiple mutual informations and multiple interactions in frequency data. Inf Control 46(1):26–45
Lin D, Tang X (2006) Conditional infomax learning: an integrated framework for feature extraction and fusion. European Conference on Computer Vision
Matsuda H (2000) Physical nature of higher-order mutual information: intrinsic correlations and frustration. Phys Rev E - Stat Phys Plasmas Fluids Related Interdiscip Topics 62(3 A):3096–3102
McGill WJ (1954) Multivariate information transmission. Psychometrika 19 (2):97–116
Meyer P, Schretter C, Bontempi G (2008) Information-theoretic feature selection in microarray data using variable complementarity. IEEE J Selected Topics in Signal Process 2:261–274
Mielniczuk J, Rdzanowski M (2017) Use of information measures and their approximations to detect predictive gene-gene interaction. Entropy 19:1–23
Mielniczuk J, Teisseyre P (2018) A deeper look at two concepts of measuring gene-gene interactions: logistic regression and interaction information revisited. Genet Epidemiol 42(2):187–200
Moore JH, Gilbert JC, Tsai CT, Chiang FT, Holden T, Barney N, White BC (2006) A flexible computational framework for detecting, characterizing, and interpreting statistical patterns of epistasis in genetic studies of human disease susceptibility. J Theor Biol 241(2):256–261
Nelsen R (2006) An introduction to copulas, 2nd edn. Springer, London
Schott J (1997) Matrix analysis for statistics wiley series in probabiliy and statistics. Wiley, New York
SNPsyn (2011) Data set GSE8054 http://snpsyn.biolab.si/examples/gse8054.tab.gz, (date of access: August 29, 2019)
Sucheston L, Chanda P, Zhang A, Tritchler D, Ramanathan M (2010) Comparison of information-theoretic to statistical methods for gene-gene interactions in the presence of genetic heterogeneity. BMC Genom 11:1–12
Tan A, Fan J, Karikari C, et al (2008) Allele-specific expression in the germline of patients with familial pancreatic cancer: an unbiased approach to cancer gene discovery. Cancer Biol Ther 7:135–144
Tsamardinos I, Borboudakis G (2010) Permutation testing improves on Bayesian network learning. In: Proceedings of ECML PKDD 2010, pp 322–337
Wan X, Yang C, Yang Q, Xue T, Fan X, Tang N, Yu W (2010) Boost: a fast approach to detecting gene-gene interactions in genome-wide case-control studies. Amer J Human Genetics 87(3):325–340
Zhang JT (2005) Approximate and asymptotic distributions of chi-squared type mixtures with applications. J Am Stat Assoc 100(469):273–285
Acknowledgements
We are grateful to Łukasz Smaga for pointing out (Zhang 2005) to us.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix
Appendix
Below we prove Lemma 7.
Proof
From the previous lemma we know that λ1(D) = … = λ4(D) = 1, λ9(D) = 0.
By lengthy calculations we obtain a sequence of the following equalities:
where \(\mathbf {Z}=(Z_{ij}^{i^{\prime }j^{\prime }})\) and:
Note that e.g. \(\text {tr}(\mathbf {D}) = (n_{X_{1}}-1)(n_{X_{2}}-1)\) follows from Lemmas 3-5 as it follows from Lemma 4 that tr(M) = −tr(C) ⋅ tr(D) and \(tr(\mathbf {M})=(n_{X_{1}}-1)(n_{X_{2}}-1)(n_{Y}-1)\) and tr(D) = 1 − nY. As trace of square matrix is sum of its eigenvalues from the above equalities it follows that:
From the Newton-Girard identities we obtain:
This means that λ5(D),λ6(D),λ7(D),λ8(D) are the roots of the polynomial:
Note that in view of its definition H1 ≥ 0 and if X1 and X2 are independent then H1 = H2 = Δ = 0. From this the lemma follows. □
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Kubkowski, M., Mielniczuk, J. Asymptotic Distributions of Empirical Interaction Information. Methodol Comput Appl Probab 23, 291–315 (2021). https://doi.org/10.1007/s11009-020-09783-0
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11009-020-09783-0
Keywords
- Interaction information
- Asymptotic weighted chi square distribution
- Linkage disequilibrium
- Interaction detection
Mathematics Subject Classification (2010)
- Primary 62G20
- Secondary 60E99