Skip to main content
Log in

Scoring Bayesian networks of mixed variables

  • Regular Paper
  • Published:
International Journal of Data Science and Analytics Aims and scope Submit manuscript

Abstract

In this paper we outline two novel scoring methods for learning Bayesian networks in the presence of both continuous and discrete variables, that is, mixed variables. While much work has been done in the domain of automated Bayesian network learning, few studies have investigated this task in the presence of both continuous and discrete variables while focusing on scalability. Our goal is to provide two novel and scalable scoring functions capable of handling mixed variables. The first method, the Conditional Gaussian (CG) score, provides a highly efficient option. The second method, the Mixed Variable Polynomial (MVP) score, allows for a wider range of modeled relationships, including nonlinearity, but it is slower than CG. Both methods calculate log likelihood and degrees of freedom terms, which are incorporated into a Bayesian Information Criterion (BIC) score. Additionally, we introduce a structure prior for efficient learning of large networks and a simplification in scoring the discrete case which performs well empirically. While the core of this work focuses on applications in the search and score paradigm, we also show how the introduced scoring functions may be readily adapted as conditional independence tests for constraint-based Bayesian network learning algorithms. Lastly, we describe ways to simulate networks of mixed variable types and evaluate our proposed methods on such simulations.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Notes

  1. https://github.com/cmu-phil/tetrad.

References

  1. Anderson, T., Taylor, J.B.: Strong consistency of least squares estimates in normal linear regression. The Ann. Stat., pp. 788–790 (1976)

  2. Bishop, C.M.: Pattern Recognition and Machine Learning. Springer, Berlin (2006)

    MATH  Google Scholar 

  3. Bøttcher, S.G.: Learning bayesian networks with mixed variables. Ph.D. thesis, Aalborg University (2004)

  4. Chen, J., Chen, Z.: Extended bic for small-n-large-p sparse glm. Statistica Sinica pp. 555–574 (2012)

  5. Chickering, D.M.: Optimal structure identification with greedy search. J. Mach. Learn. Res. 3, 507–554 (2002)

    MathSciNet  MATH  Google Scholar 

  6. Daly, R., Shen, Q., Aitken, S.: Review: learning bayesian networks: approaches and issues. The Knowl. Eng. Rev. 26(2), 99–157 (2011)

    Article  Google Scholar 

  7. Fan, R.E., Chang, K.W., Hsieh, C.J., Wang, X.R., Lin, C.J.: LIBLINEAR: a library for large linear classification. J. Mach. Learn. Res. 9, 1871–1874 (2008)

    MATH  Google Scholar 

  8. Heckerman, D., Geiger, D.: Learning bayesian networks: a unification for discrete and gaussian domains. In: Proceedings of Conference on Uncertainty in Artificial Intelligence, pp. 274–284. Morgan Kaufmann Publishers Inc. (1995)

  9. Hsia, C.Y., Zhu, Y., Lin, C.J.: A study on trust region update rules in newton methods for large-scale linear classification. In: Asian Conference on Machine Learning, pp. 33–48 (2017)

  10. Huang, T., Peng, H., Zhang, K.: Model selection for gaussian mixture models. Statistica Sinica 27(1), 147–169 (2017)

    MathSciNet  MATH  Google Scholar 

  11. Jeffreys, H., Jeffreys, B.: Weierstrasss theorem on approximation by polynomials. Methods of Mathematical Physics pp. 446–448 (1988)

  12. Peters, J., Janzing, D., Schölkopf, B.: Elements of Causal Inference: Foundations and Learning Algorithms. The MIT Press, Cambridge (2017)

    Google Scholar 

  13. McGeachie, M.J., Chang, H.H., Weiss, S.T.: Cgbayesnets: conditional gaussian bayesian network learning and inference with mixed discrete and continuous data. PLoS Comput. Biol. 10(6), e1003676 (2014)

    Article  Google Scholar 

  14. Meek, C.: Complete orientation rules for patterns (1995)

  15. Monti, S., Cooper, G.F.: A multivariate discretization method for learning bayesian networks from mixed data. In: Proceedings of the Conference on Uncertainty in Artificial Intelligence, pp. 404–413. Morgan Kaufmann Publishers Inc. (1998)

  16. Pearl, J.: Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann, Burlington (1988)

    MATH  Google Scholar 

  17. Raftery, A.E.: Bayesian model selection in social research. Sociol. Methodol., pp. 111–163 (1995)

  18. Ramsey, J., Glymour, M., Sanchez-Romero, R., Glymour, C.: A million variables and more: the fast greedy equivalence search algorithm for learning high-dimensional graphical causal models, with an application to functional magnetic resonance images. Int. J. Data Sci. Anal., pp. 1–9 (2016)

  19. Ramsey, J., Zhang, J., Spirtes, P.: Adjacency-faithfulness and conservative causal inference. In: Proceedings of Conference on Uncertainty in Artificial Intelligence, pp. 401–408. AUAI Press, Arlington, Virginia (2006)

  20. Ramsey, J.D., Malinsky, D.: Comparing the performance of graphical structure learning algorithms with tetrad. arXiv preprint arXiv:1607.08110 (2016)

  21. Romero, V., Rumí, R., Salmerón, A.: Learning hybrid bayesian networks using mixtures of truncated exponentials. Int. J. Approx. Reason. 42(1–2), 54–68 (2006)

    Article  MathSciNet  MATH  Google Scholar 

  22. Scheines, R., Spirtes, P., Glymour, C., Meek, C., Richardson, T.: The tetrad project: constraint based aids to causal model specification. Multivar. Behav. Res. 33(1), 65–117 (1998)

    Article  Google Scholar 

  23. Schwarz, G.: Estimating the dimension of a model. Ann. Stat. 6(2), 461–464 (1978)

    Article  MathSciNet  MATH  Google Scholar 

  24. Sedgewick, A.J., Shi, I., Donovan, R.M., Benos, P.V.: Learning mixed graphical models with separate sparsity parameters and stability-based model selection. BMC Bioinform. 17(Suppl 5), 175 (2016)

    Article  Google Scholar 

  25. Sokolova, E., Groot, P., Claassen, T., Heskes, T.: Causal discovery from databases with discrete and continuous variables. In: European Workshop on Probabilistic Graphical Models, pp. 442–457. Springer (2014)

  26. Spirtes, P., Glymour, C.N., Scheines, R.: Causation, Prediction, and Search. MIT Press, Cambridge (2000)

    MATH  Google Scholar 

  27. Zaidi, N.A., Webb, G.I.: A fast trust-region newton method for softmax logistic regression. In: Proceedings of the 2017 SIAM International Conference on Data Mining, pp. 705–713. SIAM (2017)

Download references

Acknowledgements

We thank Clark Glymour, Peter Spirtes, Takis Benos, Dimitrios Manatakis, and Vineet Raghu for helpful discussions about the topics in this paper. We also thank the reviewers for their helpful comments.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Gregory F. Cooper.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Research reported in this publication was supported by Grant U54HG008540 from the National Human Genome Research Institute through funds provided by the trans-NIH Big Data to Knowledge (BD2K) initiative, by Grant R01LM012087 from the National Library of Medicine, by Grant IIS-1636786 from the National Science Foundation, and by Grant #4100070287 from the Pennsylvania Department of Health (PA DOH). The PA DOH specifically disclaims responsibility for any analyses, interpretations, or conclusions. The content of this paper is solely the responsibility of the authors and does not necessarily represent the official views of the granting agencies.

Appendices

Appendix A

Proposition 1

The Conditional Gaussian Score is score equivalent.

Proof

Let \(\mathcal {G}_{1}\) and \(\mathcal {G}_{2}\) be directed acyclic graphs with Conditional Gaussian scores \(\mathcal {S}_{1}\) and \(\mathcal {S}_{2}\), respectively. Further, let \(\mathcal {G}_{1} \ne \mathcal {G}_{2}\), but \(\mathcal {G}_{1}\) and \(\mathcal {G}_{2}\) in the same Markov equivalence class. \(\square \)

Remove all shared local components between \(\mathcal {S}_{1}\) and \(\mathcal {S}_{2}\) and the corresponding edges in \(\mathcal {G}_{1}\) and \(\mathcal {G}_{2}\). Call the newly pruned scores \(\mathcal {S}_{1}'\) and \(\mathcal {S}_{2}'\) and the newly pruned graphs \(\mathcal {G}_{1}'\) and \(\mathcal {G}_{2}'\), respectively. Note that we have removed all initial unshielded colliders common to both \(\mathcal {G}_{1}\) and \(\mathcal {G}_{2}\). Additionally, it follows from Meek’s rules that \(\mathcal {G}_{1}'\) and \(\mathcal {G}_{2}'\) must contain no unshielded colliders since any component which could have become an unshielded collider is necessarily shared between graphs \(\mathcal {G}_{1}\) and \(\mathcal {G}_{2}\) and thus pruned [14].

Since \(\mathcal {G}_{1}'\) and \(\mathcal {G}_{2}'\) are acyclic graphs without any unshielded colliders, we can represent them both as a join tree of cliques. Further, they share the same skeleton because they come from the same Markov equivalence class, and thus, they can be represented by the same join tree of cliques.

It follows from Pearl, Probabilistic Reasoning in Intelligent Systems 3.2.4, Theorem 8, that the distribution encoded by \(\mathcal {G}_{1}'\) and \(\mathcal {G}_{2}'\) can be written as a product of the distributions of the cliques of \(\mathcal {G}_{1}'\) and \(\mathcal {G}_{2}'\) divided by a product of the distributions of their intersections [16]. Therefore, when we calculate \(\mathcal {S}_{1}'\) and \(\mathcal {S}_{2}'\), we can use the same ratio of joint distributions to obtain the log likelihood and degrees of freedom terms. Hence \(\mathcal {S}_{1}' = \mathcal {S}_{2}'\) and therefore, after adding back the shared local components, we have \(\mathcal {S}_{1} = \mathcal {S}_{2}\).

Proposition 2

If the approximating polynomial from Weierstrass Approximation Theorem for the data generating function is a polynomial of degree d, and the maximum-degree polynomial used by MVP is at least d, then the MVP score will be consistent in the large sample limit.

Proof

Let \(\varvec{p}\) be the approximating polynomial(s) from Weierstrass Approximation Theorem [11]. Assuming the least squares estimate(s) contains the same polynomials degrees as \(\varvec{p}\), the least squares estimate will converge to \(\varvec{p}\) as the number of samples \(n \rightarrow \infty \) [1]. Therefore, by Weierstrass Approximation Theorem, the least squares estimate(s) will converge to the true data generating function(s). Accordingly, the log likelihood term of MVP will be maximal for any model that has either been correctly specified or over specified, where by over over specified we are referring to a model containing all the true parameters and more. \(\square \)

However for any over specified model, the parameter penalty will necessarily be larger and hence the MVP score in that case will be lower for the over specified model.

Additionally, in the case of an underspecified model, note that the parameter penalty term is of order \(O(\log n)\) while the likelihood term is of order O(n). This means that in the large sample limit, when comparing an underspecified model to any model containing the correctly specified model, the MVP score will be lower for the under specified model since the log likelihood is not maximal while in the other case it is.

Therefore, the MVP score is consistent.

Proposition 3

The approximated values from least squares regression sum to one.

Proof

Let the terms in the below equation be defined according to Sect. 4.3.

$$\begin{aligned} \sum _{h=1}^{d} \varvec{X}_p \hat{\varvec{\beta }}_{h}&= \varvec{X}_p (\varvec{X}_p^{T} \varvec{X}_p)^{-1} \varvec{X}_p^{T} \sum _{h=1}^{d} \varvec{1}_{\{\varvec{y}_p = h\}} \\&= \varvec{X}_p (\varvec{X}_p^{T} \varvec{X}_p)^{-1} \varvec{X}_p^{T} \varvec{1}_p \\&= \varvec{1}_p \end{aligned}$$

For the last step, we use that \(\varvec{1}_p\) is in the column space of \({\varvec{X}}_{p}\) and is thus projected to itself. \(\square \)

Proposition 4

If the approximating polynomial from Weierstrass Approximation Theorem for the data generating function is a polynomial of degree d, and the maximum-degree polynomial used by MVP is at least d, then the least squares approximations for probability mass functions will be strictly nonnegative in the large sample limit.

Proof

Let f be a generating component of a conditional probability mass function. Since the Weierstrass Approximation Theorem is satisfied by the assumptions of MVP, there must exists a polynomial p such that for every \(\epsilon > 0\) and all \(x \in [a,b]\), we have \(|f(x) - p(x)| < \epsilon \) [11]. \(\square \)

For \(x \in [a,b]\) where \(p(x) \ge f(x)\), p(x) is trivially nonnegative since \(f(x) > 0\).

For \(x \in [a,b]\) where \(p(x) < f(x)\), let \(m = f(x)\) and choose \(\epsilon = \frac{m}{2}\). Then,

$$\begin{aligned} |f(x) - p(x)|&< \epsilon \\ f(x) - p(x)&< \epsilon \\ p(x)&> f(x) - \epsilon \\ p(x)&> m - \frac{m}{2} \\ p(x)&> 0 \end{aligned}$$

since \(m > 0\).

Assuming the least squares estimate(s) contains the same polynomials degrees as p, the least squares estimate will converge to p as the number of samples \(n \rightarrow \infty \) [1]. Thus, as the number of samples \(n \rightarrow \infty \), the least squares approximations are strictly nonnegative.

Appendix B

In this appendix, we detail the parameters used for simulation of the data. Each parameter will be followed by the values we used in simulation and a short description. We split the parameters into 3 groups: general parameters used across all simulations, parameters specific to linear simulation, and parameters specific to nonlinear simulation.

1.1 General Parameters

  • numRuns: 10 - number of runs

  • numMeasures: 100, 500 - number of measured variables

  • avgDegree: 2, 4 - average degree of graph

  • sampleSize: 200, 1000 - sample size

  • minCategories: 2 - minimum number of categories

  • maxCategories: 5 - maximum number of categories

  • percentDiscrete: 50 - percentage of discrete variables (0 - 100) for mixed data

  • differentGraphs: true - true if a different graph should be used for each run

  • maxDegree: 5 - maximum degree of the graph

  • maxIndegree: 5 - maximum indegree of graph

  • maxOutdegree: 5 - maximum outdegree of graph

  • coefSymmetric: true - true if negative coefficient values should be considered

1.2 Linear Parameters

  • varLow: 1 - low end of variance range

  • varHigh: 3 - high end of variance range

  • coefLow: 0.05 - low end of coefficient range

  • coefHigh: 1.5 - high end of coefficient range

  • meanLow: -1 - low end of mean range

  • meanHigh: 1 - high end of mean range

1.3 Nonlinear Parameters

  • dirichlet: 0.5 - alpha parameter for Dirichlet to draw multinomials

  • interceptLow: 1 - low end of intercept range

  • interceptHigh: 2 - high end of intercept range

  • linearLow: 1.0 - low end of linear coefficient range

  • linearHigh: 2.0 - high end of linear coefficient range

  • quadraticLow: 0.5 - low end quadratic coefficient range

  • quadraticHigh: 1.0 - high end of quadratic coefficient range

  • cubicLow: 0.2 - low end of cubic coefficient range

  • cubicHigh: 0.3 - high end of cubic coefficient range

  • varLow: 0.5 - low end of variance range

  • varHigh: 0.5 - high end of variance range

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Andrews, B., Ramsey, J. & Cooper, G.F. Scoring Bayesian networks of mixed variables. Int J Data Sci Anal 6, 3–18 (2018). https://doi.org/10.1007/s41060-017-0085-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s41060-017-0085-7

Keywords

Navigation