Abstract
This paper introduces a technique to estimate mutual information in data sets that comprise discrete and continuous variables. Utilizing the Chow–Liu algorithm, our approach constructs a forest that captures the underlying probabilistic dependencies among variables. Conventional methods maximize likelihood, particularly in sequences involving discrete, Gaussian, and discrete variables, thus limiting the class of permissible forests. Our novel methodology overcomes these constraints by accommodating discrete and continuous random variables simultaneously. Initially, we used copula techniques to estimate the joint density of mixed-type variables. Subsequently, we apply the Watanabe Bayesian Information Criterion (WBIC) to compute the free energies, enabling a more sophisticated estimation of mutual information between the discrete and continuous variables. This innovation improves the capabilities of existing mutual information estimation frameworks. When integrated with the Chow-Liu algorithm, our estimator produces a forest topology instead of a mere spanning tree without restrictive assumptions. Our method successfully links genomic expression to single nucleotide polymorphism (SNP) data in genome expression analysis.
Similar content being viewed by others
Data availability
The data that support the findings of this study are available at https://github.com/ash141886/Forest-based-on-WBIC.
References
Aas K, Czado C, Frigessi A, Bakken H (2009) Pair-copula constructions of multiple dependence. Insurance 44(2):182–198
Akaike H (1974) A new look at the statistical model identification. IEEE Trans. Auto. Control 19(6):716–723
Arbenz P (2013) Bayesian copulae distributions, with application to operational risk management-some comments. Methodol. Comput. Appl. Probability 15:105–108
Barron AR, Cover TM (1991) Minimum complexity density estimation. IEEE Trans. Inform. Theory 37(4):1034–1054
Bender EA, Williamson SG (2010) Lists, decisions and graphs. S. Gill Williamson
Betancourt M (2017) A conceptual introduction to hamiltonian monte carlo. arXiv preprint arXiv:1701.02434
Botev ZI (2017) The normal law under linear restrictions: simulation and estimation via minimax tilting. J. R. Stat. Soc. Ser. B 79(1):125–148
Clayton DG (1978) A model for association in bivariate life tables and its application in epidemiological studies of familial tendency in chronic disease incidence. Biometrika 65(1):141–151
Durante F, Fernandez-Sanchez J, Sempi C (2013) A topological proof of sklar’s theorem. Appl. Math. Lett. 26(9):945–948
Edwards D, De Abreu GC, Labouriau R (2010) Selecting high-dimensional mixed graphical models using minimal aic or bic forests. BMC Bioinform. 11(1):1–13
Gamazon ER, Zhang W, Konkashbaev A, Duan S, Kistner EO, Nicolae DL, Cox NJ (2010) Scan: Snp and copy number annotation. Bioinform. 26(2):259–262
Gelman A (2011) Induction and deduction in bayesian data analysis
Gelman A, Rubin DB (1992) Inference from iterative simulation using multiple sequences. Statistical science, 457–472
Hoffman MD, Gelman A et al (2014) The no-u-turn sampler: adaptively setting path lengths in hamiltonian monte carlo. J. Mach. Learn. Res. 15(1):1593–1623
Janosi A, Steinbrunn W, Pfisterer M, Detrano R, MD M (n.d.) (1988) Heart disease, uci machine learning repository
Joe H (2014) Dependence modeling with copulas. CRC Press
Kojadinovic I, Yan J (2010) Modeling multivariate distributions with continuous margins using the copula r package. J. Stat. Softw. 34:1–20
McElreath R (2020) Statistical rethinking: A bayesian course with examples in r and stan. CRC Press
Miller LD, Smeds J, George J, Vega VB, Vergara L, Ploner A et al (2005) An expression signature for p53 status in human breast cancer predicts mutation status, transcriptional effects, and patient survival. Proc Natl Acad Sci 102(38):13550–13555
Panayidou K (2010) Estimation of tree structure for variable selection (Unpublished doctoral dissertation). Oxford University
Schmidt T (2007) Coping with copulas. Copulas-From Theory Appl Finance 3:1–34
Schwarz G (1978) Estimating the dimension of a model. The annals of statistics, 461–464
Shannon CE (1948) A mathematical theory of communication. Bell Syst Tech J 27(3):379–423
Sklar M (1959) Fonctions de répartition à n dimensions et leurs marges. In Annales de l’isup (Vol. 8, pp. 229–231)
Suzuki J (1993) A construction of bayesian networks from databases based on an mdl principle. In Uncertainty in artificial intelligence (pp. 266–273)
Suzuki J (2012) The bayesian chow-liu algorithm. In The sixth european workshop on probabilistic graphical models (pp. 315–322)
Suzuki J (2015) Consistency of learning bayesian network structures with continuous variables: an information theoretic approach. Entropy 17(8):5752–5770
Suzuki J (2017) A novel chow-liu algorithm and its application to gene differential analysis. Int J Approx Reason 80:1–18
Suzuki J (2023) Waic and wbic with r stan: 100 exercises for building logic. Springer Nature
Team SD et al. (2017) Stan modeling language user’s guide and reference manual, version 2.17. 0. Stan Development Team
Watanabe S (2013) A widely applicable bayesian information criterion. J Mach Learn Res 14(27):867–897
Watanabe S (2021) Waic and wbic for mixture models. Behaviormetrika 48(1):5–21
Acknowledgements
The first author gratefully acknowledges to the Ministry of Education, Culture, Sports, Science and Technology (MEXT), Japan, for his financial assistance.
Funding
No funding was received for conducting this study.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare no conflict of interest.
Additional information
Communicated by Maomi Ueno.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
About this article
Cite this article
Islam, A., Suzuki, J. Forest construction of Gaussian and discrete variables with the application of Watanabe Bayesian Information Criterion. Behaviormetrika (2024). https://doi.org/10.1007/s41237-024-00227-4
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s41237-024-00227-4