Scoring Bayesian networks of mixed variables

Andrews, Bryan; Ramsey, Joseph; Cooper, Gregory F.

doi:10.1007/s41060-017-0085-7

Scoring Bayesian networks of mixed variables

Regular Paper
Published: 11 January 2018

Volume 6, pages 3–18, (2018)
Cite this article

International Journal of Data Science and Analytics Aims and scope Submit manuscript

Bryan Andrews¹,
Joseph Ramsey² &
Gregory F. Cooper¹

1410 Accesses
25 Citations
3 Altmetric
Explore all metrics

Abstract

In this paper we outline two novel scoring methods for learning Bayesian networks in the presence of both continuous and discrete variables, that is, mixed variables. While much work has been done in the domain of automated Bayesian network learning, few studies have investigated this task in the presence of both continuous and discrete variables while focusing on scalability. Our goal is to provide two novel and scalable scoring functions capable of handling mixed variables. The first method, the Conditional Gaussian (CG) score, provides a highly efficient option. The second method, the Mixed Variable Polynomial (MVP) score, allows for a wider range of modeled relationships, including nonlinearity, but it is slower than CG. Both methods calculate log likelihood and degrees of freedom terms, which are incorporated into a Bayesian Information Criterion (BIC) score. Additionally, we introduce a structure prior for efficient learning of large networks and a simplification in scoring the discrete case which performs well empirically. While the core of this work focuses on applications in the search and score paradigm, we also show how the introduced scoring functions may be readily adapted as conditional independence tests for constraint-based Bayesian network learning algorithms. Lastly, we describe ways to simulate networks of mixed variable types and evaluate our proposed methods on such simulations.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Estimating power in (generalized) linear mixed models: An open introduction and tutorial in R

Article Open access 05 May 2021

Violating the normality assumption may be the lesser of two evils

Article Open access 07 May 2021

Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC

Article 30 August 2016

Notes

https://github.com/cmu-phil/tetrad.

References

Anderson, T., Taylor, J.B.: Strong consistency of least squares estimates in normal linear regression. The Ann. Stat., pp. 788–790 (1976)
Bishop, C.M.: Pattern Recognition and Machine Learning. Springer, Berlin (2006)
MATH Google Scholar
Bøttcher, S.G.: Learning bayesian networks with mixed variables. Ph.D. thesis, Aalborg University (2004)
Chen, J., Chen, Z.: Extended bic for small-n-large-p sparse glm. Statistica Sinica pp. 555–574 (2012)
Chickering, D.M.: Optimal structure identification with greedy search. J. Mach. Learn. Res. 3, 507–554 (2002)
MathSciNet MATH Google Scholar
Daly, R., Shen, Q., Aitken, S.: Review: learning bayesian networks: approaches and issues. The Knowl. Eng. Rev. 26(2), 99–157 (2011)
Article Google Scholar
Fan, R.E., Chang, K.W., Hsieh, C.J., Wang, X.R., Lin, C.J.: LIBLINEAR: a library for large linear classification. J. Mach. Learn. Res. 9, 1871–1874 (2008)
MATH Google Scholar
Heckerman, D., Geiger, D.: Learning bayesian networks: a unification for discrete and gaussian domains. In: Proceedings of Conference on Uncertainty in Artificial Intelligence, pp. 274–284. Morgan Kaufmann Publishers Inc. (1995)
Hsia, C.Y., Zhu, Y., Lin, C.J.: A study on trust region update rules in newton methods for large-scale linear classification. In: Asian Conference on Machine Learning, pp. 33–48 (2017)
Huang, T., Peng, H., Zhang, K.: Model selection for gaussian mixture models. Statistica Sinica 27(1), 147–169 (2017)
MathSciNet MATH Google Scholar
Jeffreys, H., Jeffreys, B.: Weierstrasss theorem on approximation by polynomials. Methods of Mathematical Physics pp. 446–448 (1988)
Peters, J., Janzing, D., Schölkopf, B.: Elements of Causal Inference: Foundations and Learning Algorithms. The MIT Press, Cambridge (2017)
Google Scholar
McGeachie, M.J., Chang, H.H., Weiss, S.T.: Cgbayesnets: conditional gaussian bayesian network learning and inference with mixed discrete and continuous data. PLoS Comput. Biol. 10(6), e1003676 (2014)
Article Google Scholar
Meek, C.: Complete orientation rules for patterns (1995)
Monti, S., Cooper, G.F.: A multivariate discretization method for learning bayesian networks from mixed data. In: Proceedings of the Conference on Uncertainty in Artificial Intelligence, pp. 404–413. Morgan Kaufmann Publishers Inc. (1998)
Pearl, J.: Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann, Burlington (1988)
MATH Google Scholar
Raftery, A.E.: Bayesian model selection in social research. Sociol. Methodol., pp. 111–163 (1995)
Ramsey, J., Glymour, M., Sanchez-Romero, R., Glymour, C.: A million variables and more: the fast greedy equivalence search algorithm for learning high-dimensional graphical causal models, with an application to functional magnetic resonance images. Int. J. Data Sci. Anal., pp. 1–9 (2016)
Ramsey, J., Zhang, J., Spirtes, P.: Adjacency-faithfulness and conservative causal inference. In: Proceedings of Conference on Uncertainty in Artificial Intelligence, pp. 401–408. AUAI Press, Arlington, Virginia (2006)
Ramsey, J.D., Malinsky, D.: Comparing the performance of graphical structure learning algorithms with tetrad. arXiv preprint arXiv:1607.08110 (2016)
Romero, V., Rumí, R., Salmerón, A.: Learning hybrid bayesian networks using mixtures of truncated exponentials. Int. J. Approx. Reason. 42(1–2), 54–68 (2006)
Article MathSciNet MATH Google Scholar
Scheines, R., Spirtes, P., Glymour, C., Meek, C., Richardson, T.: The tetrad project: constraint based aids to causal model specification. Multivar. Behav. Res. 33(1), 65–117 (1998)
Article Google Scholar
Schwarz, G.: Estimating the dimension of a model. Ann. Stat. 6(2), 461–464 (1978)
Article MathSciNet MATH Google Scholar
Sedgewick, A.J., Shi, I., Donovan, R.M., Benos, P.V.: Learning mixed graphical models with separate sparsity parameters and stability-based model selection. BMC Bioinform. 17(Suppl 5), 175 (2016)
Article Google Scholar
Sokolova, E., Groot, P., Claassen, T., Heskes, T.: Causal discovery from databases with discrete and continuous variables. In: European Workshop on Probabilistic Graphical Models, pp. 442–457. Springer (2014)
Spirtes, P., Glymour, C.N., Scheines, R.: Causation, Prediction, and Search. MIT Press, Cambridge (2000)
MATH Google Scholar
Zaidi, N.A., Webb, G.I.: A fast trust-region newton method for softmax logistic regression. In: Proceedings of the 2017 SIAM International Conference on Data Mining, pp. 705–713. SIAM (2017)

Download references

Acknowledgements

We thank Clark Glymour, Peter Spirtes, Takis Benos, Dimitrios Manatakis, and Vineet Raghu for helpful discussions about the topics in this paper. We also thank the reviewers for their helpful comments.

Author information

Authors and Affiliations

University of Pittsburgh, Pittsburgh, PA, 15260, USA
Bryan Andrews & Gregory F. Cooper
Carnegie Mellon University, Pittsburgh, PA, 15213, USA
Joseph Ramsey

Authors

Bryan Andrews
View author publications
You can also search for this author in PubMed Google Scholar
Joseph Ramsey
View author publications
You can also search for this author in PubMed Google Scholar
Gregory F. Cooper
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Gregory F. Cooper.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Research reported in this publication was supported by Grant U54HG008540 from the National Human Genome Research Institute through funds provided by the trans-NIH Big Data to Knowledge (BD2K) initiative, by Grant R01LM012087 from the National Library of Medicine, by Grant IIS-1636786 from the National Science Foundation, and by Grant #4100070287 from the Pennsylvania Department of Health (PA DOH). The PA DOH specifically disclaims responsibility for any analyses, interpretations, or conclusions. The content of this paper is solely the responsibility of the authors and does not necessarily represent the official views of the granting agencies.

Appendices

Appendix A

Proposition 1

The Conditional Gaussian Score is score equivalent.

Proof

Let $\mathcal {G}_{1}$ and $\mathcal {G}_{2}$ be directed acyclic graphs with Conditional Gaussian scores $\mathcal {S}_{1}$ and $\mathcal {S}_{2}$, respectively. Further, let $\mathcal {G}_{1} \ne \mathcal {G}_{2}$, but $\mathcal {G}_{1}$ and $\mathcal {G}_{2}$ in the same Markov equivalence class. $\square $

Remove all shared local components between $\mathcal {S}_{1}$ and $\mathcal {S}_{2}$ and the corresponding edges in $\mathcal {G}_{1}$ and $\mathcal {G}_{2}$. Call the newly pruned scores $\mathcal {S}_{1}'$ and $\mathcal {S}_{2}'$ and the newly pruned graphs $\mathcal {G}_{1}'$ and $\mathcal {G}_{2}'$, respectively. Note that we have removed all initial unshielded colliders common to both $\mathcal {G}_{1}$ and $\mathcal {G}_{2}$. Additionally, it follows from Meek’s rules that $\mathcal {G}_{1}'$ and $\mathcal {G}_{2}'$ must contain no unshielded colliders since any component which could have become an unshielded collider is necessarily shared between graphs $\mathcal {G}_{1}$ and $\mathcal {G}_{2}$ and thus pruned [14].

Since $\mathcal {G}_{1}'$ and $\mathcal {G}_{2}'$ are acyclic graphs without any unshielded colliders, we can represent them both as a join tree of cliques. Further, they share the same skeleton because they come from the same Markov equivalence class, and thus, they can be represented by the same join tree of cliques.

It follows from Pearl, Probabilistic Reasoning in Intelligent Systems 3.2.4, Theorem 8, that the distribution encoded by $\mathcal {G}_{1}'$ and $\mathcal {G}_{2}'$ can be written as a product of the distributions of the cliques of $\mathcal {G}_{1}'$ and $\mathcal {G}_{2}'$ divided by a product of the distributions of their intersections [16]. Therefore, when we calculate $\mathcal {S}_{1}'$ and $\mathcal {S}_{2}'$, we can use the same ratio of joint distributions to obtain the log likelihood and degrees of freedom terms. Hence $\mathcal {S}_{1}' = \mathcal {S}_{2}'$ and therefore, after adding back the shared local components, we have $\mathcal {S}_{1} = \mathcal {S}_{2}$.

Proposition 2

If the approximating polynomial from Weierstrass Approximation Theorem for the data generating function is a polynomial of degree d, and the maximum-degree polynomial used by MVP is at least d, then the MVP score will be consistent in the large sample limit.

Proof

Let $\varvec{p}$ be the approximating polynomial(s) from Weierstrass Approximation Theorem [11]. Assuming the least squares estimate(s) contains the same polynomials degrees as $\varvec{p}$, the least squares estimate will converge to $\varvec{p}$ as the number of samples $n \rightarrow \infty $ [1]. Therefore, by Weierstrass Approximation Theorem, the least squares estimate(s) will converge to the true data generating function(s). Accordingly, the log likelihood term of MVP will be maximal for any model that has either been correctly specified or over specified, where by over over specified we are referring to a model containing all the true parameters and more. $\square $

However for any over specified model, the parameter penalty will necessarily be larger and hence the MVP score in that case will be lower for the over specified model.

Additionally, in the case of an underspecified model, note that the parameter penalty term is of order $O(\log n)$ while the likelihood term is of order O(n). This means that in the large sample limit, when comparing an underspecified model to any model containing the correctly specified model, the MVP score will be lower for the under specified model since the log likelihood is not maximal while in the other case it is.

Therefore, the MVP score is consistent.

Proposition 3

The approximated values from least squares regression sum to one.

Proof

Let the terms in the below equation be defined according to Sect. 4.3.

$$\begin{aligned} \sum _{h=1}^{d} \varvec{X}_p \hat{\varvec{\beta }}_{h}&= \varvec{X}_p (\varvec{X}_p^{T} \varvec{X}_p)^{-1} \varvec{X}_p^{T} \sum _{h=1}^{d} \varvec{1}_{\{\varvec{y}_p = h\}} \\&= \varvec{X}_p (\varvec{X}_p^{T} \varvec{X}_p)^{-1} \varvec{X}_p^{T} \varvec{1}_p \\&= \varvec{1}_p \end{aligned}$$

For the last step, we use that $\varvec{1}_p$ is in the column space of ${\varvec{X}}_{p}$ and is thus projected to itself. $\square $

Proposition 4

If the approximating polynomial from Weierstrass Approximation Theorem for the data generating function is a polynomial of degree d, and the maximum-degree polynomial used by MVP is at least d, then the least squares approximations for probability mass functions will be strictly nonnegative in the large sample limit.

Proof

Let f be a generating component of a conditional probability mass function. Since the Weierstrass Approximation Theorem is satisfied by the assumptions of MVP, there must exists a polynomial p such that for every $\epsilon > 0$ and all $x \in [a,b]$, we have $|f(x) - p(x)| < \epsilon $ [11]. $\square $

For $x \in [a,b]$ where $p(x) \ge f(x)$, p(x) is trivially nonnegative since $f(x) > 0$.

For $x \in [a,b]$ where $p(x) < f(x)$, let $m = f(x)$ and choose $\epsilon = \frac{m}{2}$. Then,

$$\begin{aligned} |f(x) - p(x)|&< \epsilon \\ f(x) - p(x)&< \epsilon \\ p(x)&> f(x) - \epsilon \\ p(x)&> m - \frac{m}{2} \\ p(x)&> 0 \end{aligned}$$

since $m > 0$.

Assuming the least squares estimate(s) contains the same polynomials degrees as p, the least squares estimate will converge to p as the number of samples $n \rightarrow \infty $ [1]. Thus, as the number of samples $n \rightarrow \infty $, the least squares approximations are strictly nonnegative.

Appendix B

In this appendix, we detail the parameters used for simulation of the data. Each parameter will be followed by the values we used in simulation and a short description. We split the parameters into 3 groups: general parameters used across all simulations, parameters specific to linear simulation, and parameters specific to nonlinear simulation.

1.1 General Parameters

numRuns: 10 - number of runs
numMeasures: 100, 500 - number of measured variables
avgDegree: 2, 4 - average degree of graph
sampleSize: 200, 1000 - sample size
minCategories: 2 - minimum number of categories
maxCategories: 5 - maximum number of categories
percentDiscrete: 50 - percentage of discrete variables (0 - 100) for mixed data
differentGraphs: true - true if a different graph should be used for each run
maxDegree: 5 - maximum degree of the graph
maxIndegree: 5 - maximum indegree of graph
maxOutdegree: 5 - maximum outdegree of graph
coefSymmetric: true - true if negative coefficient values should be considered

1.2 Linear Parameters

varLow: 1 - low end of variance range
varHigh: 3 - high end of variance range
coefLow: 0.05 - low end of coefficient range
coefHigh: 1.5 - high end of coefficient range
meanLow: -1 - low end of mean range
meanHigh: 1 - high end of mean range

1.3 Nonlinear Parameters

dirichlet: 0.5 - alpha parameter for Dirichlet to draw multinomials
interceptLow: 1 - low end of intercept range
interceptHigh: 2 - high end of intercept range
linearLow: 1.0 - low end of linear coefficient range
linearHigh: 2.0 - high end of linear coefficient range
quadraticLow: 0.5 - low end quadratic coefficient range
quadraticHigh: 1.0 - high end of quadratic coefficient range
cubicLow: 0.2 - low end of cubic coefficient range
cubicHigh: 0.3 - high end of cubic coefficient range
varLow: 0.5 - low end of variance range
varHigh: 0.5 - high end of variance range

Rights and permissions

Reprints and permissions

About this article

Cite this article

Andrews, B., Ramsey, J. & Cooper, G.F. Scoring Bayesian networks of mixed variables. Int J Data Sci Anal 6, 3–18 (2018). https://doi.org/10.1007/s41060-017-0085-7

Download citation

Received: 26 May 2017
Accepted: 09 December 2017
Published: 11 January 2018
Issue Date: August 2018
DOI: https://doi.org/10.1007/s41060-017-0085-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Scoring Bayesian networks of mixed variables

Abstract

Access this article

Similar content being viewed by others

Estimating power in (generalized) linear mixed models: An open introduction and tutorial in R

Violating the normality assumption may be the lesser of two evils

Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Appendices

Appendix A

Proposition 1

Proof

Proposition 2

Proof

Proposition 3

Proof

Proposition 4

Proof

Appendix B

1.1 General Parameters

1.2 Linear Parameters

1.3 Nonlinear Parameters

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Scoring Bayesian networks of mixed variables

Abstract

Access this article

Similar content being viewed by others

Estimating power in (generalized) linear mixed models: An open introduction and tutorial in R

Violating the normality assumption may be the lesser of two evils

Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Appendices

Appendix A

Proposition 1

Proof

Proposition 2

Proof

Proposition 3

Proof

Proposition 4

Proof

Appendix B

1.1 General Parameters

1.2 Linear Parameters

1.3 Nonlinear Parameters

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation