Skip to main content
Log in

Forest construction of Gaussian and discrete variables with the application of Watanabe Bayesian Information Criterion

  • Original Paper
  • Published:
Behaviormetrika Aims and scope Submit manuscript

Abstract

This paper introduces a technique to estimate mutual information in data sets that comprise discrete and continuous variables. Utilizing the Chow–Liu algorithm, our approach constructs a forest that captures the underlying probabilistic dependencies among variables. Conventional methods maximize likelihood, particularly in sequences involving discrete, Gaussian, and discrete variables, thus limiting the class of permissible forests. Our novel methodology overcomes these constraints by accommodating discrete and continuous random variables simultaneously. Initially, we used copula techniques to estimate the joint density of mixed-type variables. Subsequently, we apply the Watanabe Bayesian Information Criterion (WBIC) to compute the free energies, enabling a more sophisticated estimation of mutual information between the discrete and continuous variables. This innovation improves the capabilities of existing mutual information estimation frameworks. When integrated with the Chow-Liu algorithm, our estimator produces a forest topology instead of a mere spanning tree without restrictive assumptions. Our method successfully links genomic expression to single nucleotide polymorphism (SNP) data in genome expression analysis.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

Data availability

The data that support the findings of this study are available at https://github.com/ash141886/Forest-based-on-WBIC.

References

  • Aas K, Czado C, Frigessi A, Bakken H (2009) Pair-copula constructions of multiple dependence. Insurance 44(2):182–198

    MathSciNet  Google Scholar 

  • Akaike H (1974) A new look at the statistical model identification. IEEE Trans. Auto. Control 19(6):716–723

    Article  MathSciNet  Google Scholar 

  • Arbenz P (2013) Bayesian copulae distributions, with application to operational risk management-some comments. Methodol. Comput. Appl. Probability 15:105–108

    Article  MathSciNet  Google Scholar 

  • Barron AR, Cover TM (1991) Minimum complexity density estimation. IEEE Trans. Inform. Theory 37(4):1034–1054

    Article  MathSciNet  Google Scholar 

  • Bender EA, Williamson SG (2010) Lists, decisions and graphs. S. Gill Williamson

  • Betancourt M (2017) A conceptual introduction to hamiltonian monte carlo. arXiv preprint arXiv:1701.02434

  • Botev ZI (2017) The normal law under linear restrictions: simulation and estimation via minimax tilting. J. R. Stat. Soc. Ser. B 79(1):125–148

    Article  MathSciNet  Google Scholar 

  • Clayton DG (1978) A model for association in bivariate life tables and its application in epidemiological studies of familial tendency in chronic disease incidence. Biometrika 65(1):141–151

    Article  MathSciNet  Google Scholar 

  • Durante F, Fernandez-Sanchez J, Sempi C (2013) A topological proof of sklar’s theorem. Appl. Math. Lett. 26(9):945–948

    Article  MathSciNet  Google Scholar 

  • Edwards D, De Abreu GC, Labouriau R (2010) Selecting high-dimensional mixed graphical models using minimal aic or bic forests. BMC Bioinform. 11(1):1–13

    Article  Google Scholar 

  • Gamazon ER, Zhang W, Konkashbaev A, Duan S, Kistner EO, Nicolae DL, Cox NJ (2010) Scan: Snp and copy number annotation. Bioinform. 26(2):259–262

    Article  Google Scholar 

  • Gelman A (2011) Induction and deduction in bayesian data analysis

  • Gelman A, Rubin DB (1992) Inference from iterative simulation using multiple sequences. Statistical science, 457–472

  • Hoffman MD, Gelman A et al (2014) The no-u-turn sampler: adaptively setting path lengths in hamiltonian monte carlo. J. Mach. Learn. Res. 15(1):1593–1623

    MathSciNet  Google Scholar 

  • Janosi A, Steinbrunn W, Pfisterer M, Detrano R, MD M (n.d.) (1988) Heart disease, uci machine learning repository

  • Joe H (2014) Dependence modeling with copulas. CRC Press

  • Kojadinovic I, Yan J (2010) Modeling multivariate distributions with continuous margins using the copula r package. J. Stat. Softw. 34:1–20

    Article  Google Scholar 

  • McElreath R (2020) Statistical rethinking: A bayesian course with examples in r and stan. CRC Press

  • Miller LD, Smeds J, George J, Vega VB, Vergara L, Ploner A et al (2005) An expression signature for p53 status in human breast cancer predicts mutation status, transcriptional effects, and patient survival. Proc Natl Acad Sci 102(38):13550–13555

    Article  Google Scholar 

  • Panayidou K (2010) Estimation of tree structure for variable selection (Unpublished doctoral dissertation). Oxford University

  • Schmidt T (2007) Coping with copulas. Copulas-From Theory Appl Finance 3:1–34

    Google Scholar 

  • Schwarz G (1978) Estimating the dimension of a model. The annals of statistics, 461–464

  • Shannon CE (1948) A mathematical theory of communication. Bell Syst Tech J 27(3):379–423

    Article  MathSciNet  Google Scholar 

  • Sklar M (1959) Fonctions de répartition à n dimensions et leurs marges. In Annales de l’isup (Vol. 8, pp. 229–231)

  • Suzuki J (1993) A construction of bayesian networks from databases based on an mdl principle. In Uncertainty in artificial intelligence (pp. 266–273)

  • Suzuki J (2012) The bayesian chow-liu algorithm. In The sixth european workshop on probabilistic graphical models (pp. 315–322)

  • Suzuki J (2015) Consistency of learning bayesian network structures with continuous variables: an information theoretic approach. Entropy 17(8):5752–5770

    Article  MathSciNet  Google Scholar 

  • Suzuki J (2017) A novel chow-liu algorithm and its application to gene differential analysis. Int J Approx Reason 80:1–18

    Article  MathSciNet  Google Scholar 

  • Suzuki J (2023) Waic and wbic with r stan: 100 exercises for building logic. Springer Nature

  • Team SD et al. (2017) Stan modeling language user’s guide and reference manual, version 2.17. 0. Stan Development Team

  • Watanabe S (2013) A widely applicable bayesian information criterion. J Mach Learn Res 14(27):867–897

    MathSciNet  Google Scholar 

  • Watanabe S (2021) Waic and wbic for mixture models. Behaviormetrika 48(1):5–21

    Article  Google Scholar 

Download references

Acknowledgements

The first author gratefully acknowledges to the Ministry of Education, Culture, Sports, Science and Technology (MEXT), Japan, for his financial assistance.

Funding

No funding was received for conducting this study.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ashraful Islam.

Ethics declarations

Conflict of interest

The authors declare no conflict of interest.

Additional information

Communicated by Maomi Ueno.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Islam, A., Suzuki, J. Forest construction of Gaussian and discrete variables with the application of Watanabe Bayesian Information Criterion. Behaviormetrika (2024). https://doi.org/10.1007/s41237-024-00227-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s41237-024-00227-4

Keywords

Navigation