Statistics and Computing

, Volume 29, Issue 1, pp 161–176 | Cite as

Penalized estimation of directed acyclic graphs from discrete data

  • Jiaying Gu
  • Fei Fu
  • Qing ZhouEmail author


Bayesian networks, with structure given by a directed acyclic graph (DAG), are a popular class of graphical models. However, learning Bayesian networks from discrete or categorical data is particularly challenging, due to the large parameter space and the difficulty in searching for a sparse structure. In this article, we develop a maximum penalized likelihood method to tackle this problem. Instead of the commonly used multinomial distribution, we model the conditional distribution of a node given its parents by multi-logit regression, in which an edge is parameterized by a set of coefficient vectors with dummy variables encoding the levels of a node. To obtain a sparse DAG, a group norm penalty is employed, and a blockwise coordinate descent algorithm is developed to maximize the penalized likelihood subject to the acyclicity constraint of a DAG. When interventional data are available, our method constructs a causal network, in which a directed edge represents a causal relation. We apply our method to various simulated and real data sets. The results show that our method is very competitive, compared to many existing methods, in DAG estimation from both interventional and high-dimensional observational data.


Coordinate descent Discrete Bayesian network Multi-logit regression Structure learning Group norm penalty 



This work was supported by NSF Grant IIS-1546098 (to Q.Z.).

Supplementary material

11222_2018_9801_MOESM1_ESM.pdf (249 kb)
Supplementary material 1 (pdf 248 KB)


  1. Aragam, B., Zhou, Q.: Concave penalized estimation of sparse Bayesian networks. J. Mach. Learn. Res. 16, 2273–2328 (2015)MathSciNetzbMATHGoogle Scholar
  2. Barabási, A.L., Albert, R.: Emergence of scaling in random networks. Science 286(5439), 509–512 (1999)MathSciNetzbMATHGoogle Scholar
  3. Bielza, C., Li, G., Larranaga, P.: Multi-dimensional classification with Bayesian networks. Int. J. Approx. Reason. 52(6), 705–727 (2011)MathSciNetzbMATHGoogle Scholar
  4. Bouckaert, R.R.: Probabilistic network construction using the minimum description length principle. In: Symbolic and Quantitative Approaches to Reasoning and Uncertainty: European Conference ECSQARU ’93, Lecture Notes in Computer Science, vol. 747, pp. 41–48. Springer (1993)Google Scholar
  5. Bouckaert, R.R.: Probabilistic network construction using the minimum description length principle. Technical Report RUU-CS-94-27, Department of Computer Science, Utrecht University (1994)Google Scholar
  6. Buntine, W.: Theory refinement on Bayesian networks. In: Proceedings of the Seventh Annual Conference on Uncertainty in Artificial Intelligence, pp. 52–60. Morgan Kaufmann Publishers Inc. (1991)Google Scholar
  7. Chickering, D.M., Heckerman, D.: Efficient approximations for the marginal likelihood of Bayesian networks with hidden variables. Mach. Learn. 29(2), 181–212 (1997)zbMATHGoogle Scholar
  8. Cooper, G.F., Herskovits, E.: A Bayesian method for the induction of probabilistic networks from data. Mach. Learn. 9(4), 309–347 (1992)zbMATHGoogle Scholar
  9. Cooper, G.F., Yoo, C.: Causal discovery from a mixture of experimental and observational data. In: Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence, pp. 116–125. Morgan Kaufmann Publishers Inc. (1999)Google Scholar
  10. Csárdi, G., Nepusz, T.: The igraph software package for complex network research. InterJ. Complex Syst. 1695, 1–9 (2006).
  11. Ellis, B., Wong, W.H.: Learning causal Bayesian network structures from experimental data. J. Am. Stat. Assoc. 103(482), 778–789 (2008)MathSciNetzbMATHGoogle Scholar
  12. Erdos, P., Rényi, A.: On the evolution of random graphs. Publ. Math. Inst. Hung. Acad. Sci. 5(1), 17–60 (1960)MathSciNetzbMATHGoogle Scholar
  13. Friedman, J., Hastie, T., Höfling, H., Tibshirani, R.: Pathwise coordinate optimization. Ann. Appl. Stat. 1(2), 302–332 (2007)MathSciNetzbMATHGoogle Scholar
  14. Friedman, J., Hastie, T., Tibshirani, R.: Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 33(1), 1–22 (2010)Google Scholar
  15. Fu, W.: Penalized regressions: the bridge versus the lasso. J. Comput. Graph. Stat. 7(3), 397–416 (1998)MathSciNetGoogle Scholar
  16. Fu, F., Zhou, Q.: Learning sparse causal Gaussian networks with experimental intervention: regularization and coordinate descent. J. Am. Stat. Assoc. 108(501), 288–300 (2013)MathSciNetzbMATHGoogle Scholar
  17. Gámez, J.A., Mateo, J.L., Puerta, J.M.: Learning Bayesian networks by hill climbing: efficient methods based on progressive restriction of the neighborhood. Data Min. Knowl. Disc. 22(1–2), 106–148 (2011)MathSciNetzbMATHGoogle Scholar
  18. Han, S.W., Chen, G., Cheon, M.S., Zhong, H.: Estimation of directed acyclic graphs through two-stage adaptive lasso for gene network inference. J. Am. Stat. Assoc. 111(515), 1004–1019 (2016)MathSciNetGoogle Scholar
  19. Hauser, A., Bühlmann, P.: Characterization and greedy learning of interventional Markov equivalence classes of directed acyclic graphs. J. Mach. Learn. Res. 13, 2409–2464 (2012).
  20. Hauser, A., Bühlmann, P.: Jointly interventional and observational data: estimation of interventional markov equivalence classes of directed acyclic graphs. J. R. Stat. Soc. Ser. B Stat. Methodol. 77(1), 291–318 (2015)MathSciNetGoogle Scholar
  21. Heckerman, D., Geiger, D., Chickering, D.M.: Learning Bayesian networks: the combination of knowledge and statistical data. Mach. Learn. 20(3), 197–243 (1995)zbMATHGoogle Scholar
  22. Herskovits, E., Cooper, G.: Kutató: an entropy-driven system for construction of probabilistic expert systems from databases. In: Proceedings of the Sixth Annual Conference on Uncertainty in Artificial Intelligence, pp. 117–128. Elsevier Science Inc. (1990)Google Scholar
  23. Kalisch, M., Bühlmann, P.: Estimating high-dimensional directed acyclic graphs with the PC-algorithm. J. Mach. Learn. Res. 8, 613–636 (2007)zbMATHGoogle Scholar
  24. Kalisch, M., Mächler, M., Colombo, D., Maathuis, M.H., Bühlmann, P.: Causal inference using graphical models with the R package pcalg. J. Stat. Softw. 47(11), 1–26 (2012).
  25. Kou, S., Zhou, Q., Wong, W.H.: Equi-energy sampler with applications in statistical inference and statistical mechanics (with discussion). Ann. Stat. 34, 1581–1652 (2006)zbMATHGoogle Scholar
  26. Lam, W., Bacchus, F.: Learning Bayesian belief networks: an approach based on the MDL principle. Comput. Intell. 10(3), 269–293 (1994)Google Scholar
  27. Lee, J.D., Simchowitz, M., Jordan, M.I., Recht, B.: Gradient descent only converges to minimizers, pp. 1–12 (2016)Google Scholar
  28. Meganck, S., Leray, P., Manderick, B.: Learning causal Bayesian networks from observations and experiments: a decision theoretic approach. In: International Conference on Modeling Decisions for Artificial Intelligence, pp. 58–69. Springer (2006)Google Scholar
  29. Meier, L., van de Geer, S., Bühlmann, P.: The group lasso for logistic regression. J. R. Stat. Soc. Ser. B Stat. Methodol. 70(1), 53–71 (2008)MathSciNetzbMATHGoogle Scholar
  30. Pearl, J.: Causality: models, reasoning, and inference. Econom. Theory 19, 675–685 (2003)Google Scholar
  31. Peér, D., Regev, A., Elidan, G., Friedman, N.: Inferring subnetworks from perturbed expression profiles. Bioinformatics 17(suppl 1), S215–S224 (2001)Google Scholar
  32. Pournara, I., Wernisch, L.: Reconstruction of gene networks using Bayesian learning and manipulation experiments. Bioinformatics 20(17), 2934–2942 (2004)Google Scholar
  33. Sachs, K., Perez, O., Peér, D., Lauffenburger, D.A., Nolan, G.P.: Causal protein-signaling networks derived from multiparameter single-cell data. Science 308(5721), 523–529 (2005)Google Scholar
  34. Schmidt, M., Murphy, K.: Lassoordersearch: learning directed graphical model structure using \(\ell _1\)-penalized regression and order search. Learning 8(34), 2 (2006)Google Scholar
  35. Schmidt, M., Niculescu-Mizil, A., Murphy, K., et al.: Learning graphical model structure using \(\ell _1\)-regularization paths. AAAI 7, 1278–1283 (2007)Google Scholar
  36. Scutari, M.: Learning Bayesian networks with the bnlearn R package. J. Stat. Softw. 35(3), 1–22 (2010). MathSciNetGoogle Scholar
  37. Scutari, M.: An empirical-Bayes score for discrete Bayesian networks. In: Conference on Probabilistic Graphical Models, pp. 438–448 (2016)Google Scholar
  38. Scutari, M.: Bayesian network constraint-based structure learning algorithms: parallel and optimized implementations in the bnlearn R package. J. Stat. Softw. 77(2), 1–20 (2017). Google Scholar
  39. Shojaie, A., Michailidis, G.: Penalized likelihood methods for estimation of sparse high-dimensional directed acyclic graphs. Biometrika 97(3), 519–538 (2010)MathSciNetzbMATHGoogle Scholar
  40. Shojaie, A., Jauhiainen, A., Kallitsis, M., Michailidis, G.: Inferring regulatory networks by combining perturbation screens and steady state gene expression profiles. PLoS ONE 9(2), e82393 (2014)Google Scholar
  41. Spirtes, P., Glymour, C., Scheines, R.: Causation, Prediction, and Search. Springer, New York (1993)zbMATHGoogle Scholar
  42. Suzuki, J.: A construction of Bayesian networks from databases based on an MDL principle. In: Proceedings of the Ninth Annual Conference on Uncertainty in Artificial Intelligence, pp. 266–273 (1993)Google Scholar
  43. Tsamardinos, I., Brown, L.E., Aliferis, C.F.: The max–min hill-climbing Bayesian network structure learning algorithm. Mach. Learn. 65(1), 31–78 (2006)Google Scholar
  44. Tseng, P., Yun, S.: A coordinate gradient descent method for nonsmooth separable minimization. Math. Program. 117(1), 387–423 (2009)MathSciNetzbMATHGoogle Scholar
  45. van de Geer, S., Bühlmann, P.: \(\ell _0\)-penalized maximum likelihood for sparse directed acyclic graphs. Ann. Stat. 41(2), 536–567 (2013)zbMATHGoogle Scholar
  46. Venables, W.N., Ripley, B.D.: Modern Applied Statistics with S, 4th edn. Springer, New York (2002). ISBN 0-387-95457-0
  47. Watts, D.J., Strogatz, S.H.: Collective dynamics of ‘small-world’ networks. Nature 393(6684), 440–442 (1998)zbMATHGoogle Scholar
  48. Wu, T., Lange, K.: Coordinate descent algorithms for lasso penalized regression. Ann. Appl. Stat. 2, 224–244 (2008)MathSciNetzbMATHGoogle Scholar
  49. Yuan, M., Lin, Y.: Model selection and estimation in regression with grouped variables. J. R. Stat. Soc. Ser. B Stat. Methodol. 68(1), 49–67 (2006)MathSciNetzbMATHGoogle Scholar
  50. Zhou, Q.: Multi-domain sampling with applications to structural inference of Bayesian networks. J. Am. Stat. Assoc. 106(496), 1317–1330 (2011)MathSciNetzbMATHGoogle Scholar
  51. Zhu, J., Hastie, T.: Classification of gene microarrays by penalized logistic regression. Biostatistics 5(3), 427–443 (2004)zbMATHGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Department of StatisticsUniversity of CaliforniaLos AngelesUSA

Personalised recommendations