Skip to main content
Log in

Penalized estimation of directed acyclic graphs from discrete data

  • Published:
Statistics and Computing Aims and scope Submit manuscript

Abstract

Bayesian networks, with structure given by a directed acyclic graph (DAG), are a popular class of graphical models. However, learning Bayesian networks from discrete or categorical data is particularly challenging, due to the large parameter space and the difficulty in searching for a sparse structure. In this article, we develop a maximum penalized likelihood method to tackle this problem. Instead of the commonly used multinomial distribution, we model the conditional distribution of a node given its parents by multi-logit regression, in which an edge is parameterized by a set of coefficient vectors with dummy variables encoding the levels of a node. To obtain a sparse DAG, a group norm penalty is employed, and a blockwise coordinate descent algorithm is developed to maximize the penalized likelihood subject to the acyclicity constraint of a DAG. When interventional data are available, our method constructs a causal network, in which a directed edge represents a causal relation. We apply our method to various simulated and real data sets. The results show that our method is very competitive, compared to many existing methods, in DAG estimation from both interventional and high-dimensional observational data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

References

  • Aragam, B., Zhou, Q.: Concave penalized estimation of sparse Bayesian networks. J. Mach. Learn. Res. 16, 2273–2328 (2015)

    MathSciNet  MATH  Google Scholar 

  • Barabási, A.L., Albert, R.: Emergence of scaling in random networks. Science 286(5439), 509–512 (1999)

    Article  MathSciNet  MATH  Google Scholar 

  • Bielza, C., Li, G., Larranaga, P.: Multi-dimensional classification with Bayesian networks. Int. J. Approx. Reason. 52(6), 705–727 (2011)

    Article  MathSciNet  MATH  Google Scholar 

  • Bouckaert, R.R.: Probabilistic network construction using the minimum description length principle. In: Symbolic and Quantitative Approaches to Reasoning and Uncertainty: European Conference ECSQARU ’93, Lecture Notes in Computer Science, vol. 747, pp. 41–48. Springer (1993)

  • Bouckaert, R.R.: Probabilistic network construction using the minimum description length principle. Technical Report RUU-CS-94-27, Department of Computer Science, Utrecht University (1994)

  • Buntine, W.: Theory refinement on Bayesian networks. In: Proceedings of the Seventh Annual Conference on Uncertainty in Artificial Intelligence, pp. 52–60. Morgan Kaufmann Publishers Inc. (1991)

  • Chickering, D.M., Heckerman, D.: Efficient approximations for the marginal likelihood of Bayesian networks with hidden variables. Mach. Learn. 29(2), 181–212 (1997)

    Article  MATH  Google Scholar 

  • Cooper, G.F., Herskovits, E.: A Bayesian method for the induction of probabilistic networks from data. Mach. Learn. 9(4), 309–347 (1992)

    MATH  Google Scholar 

  • Cooper, G.F., Yoo, C.: Causal discovery from a mixture of experimental and observational data. In: Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence, pp. 116–125. Morgan Kaufmann Publishers Inc. (1999)

  • Csárdi, G., Nepusz, T.: The igraph software package for complex network research. InterJ. Complex Syst. 1695, 1–9 (2006). http://igraph.org

  • Ellis, B., Wong, W.H.: Learning causal Bayesian network structures from experimental data. J. Am. Stat. Assoc. 103(482), 778–789 (2008)

    Article  MathSciNet  MATH  Google Scholar 

  • Erdos, P., Rényi, A.: On the evolution of random graphs. Publ. Math. Inst. Hung. Acad. Sci. 5(1), 17–60 (1960)

    MathSciNet  MATH  Google Scholar 

  • Friedman, J., Hastie, T., Höfling, H., Tibshirani, R.: Pathwise coordinate optimization. Ann. Appl. Stat. 1(2), 302–332 (2007)

    Article  MathSciNet  MATH  Google Scholar 

  • Friedman, J., Hastie, T., Tibshirani, R.: Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 33(1), 1–22 (2010)

    Article  Google Scholar 

  • Fu, W.: Penalized regressions: the bridge versus the lasso. J. Comput. Graph. Stat. 7(3), 397–416 (1998)

    MathSciNet  Google Scholar 

  • Fu, F., Zhou, Q.: Learning sparse causal Gaussian networks with experimental intervention: regularization and coordinate descent. J. Am. Stat. Assoc. 108(501), 288–300 (2013)

    Article  MathSciNet  MATH  Google Scholar 

  • Gámez, J.A., Mateo, J.L., Puerta, J.M.: Learning Bayesian networks by hill climbing: efficient methods based on progressive restriction of the neighborhood. Data Min. Knowl. Disc. 22(1–2), 106–148 (2011)

    Article  MathSciNet  MATH  Google Scholar 

  • Han, S.W., Chen, G., Cheon, M.S., Zhong, H.: Estimation of directed acyclic graphs through two-stage adaptive lasso for gene network inference. J. Am. Stat. Assoc. 111(515), 1004–1019 (2016)

    Article  MathSciNet  Google Scholar 

  • Hauser, A., Bühlmann, P.: Characterization and greedy learning of interventional Markov equivalence classes of directed acyclic graphs. J. Mach. Learn. Res. 13, 2409–2464 (2012). http://jmlr.org/papers/v13/hauser12a.html

  • Hauser, A., Bühlmann, P.: Jointly interventional and observational data: estimation of interventional markov equivalence classes of directed acyclic graphs. J. R. Stat. Soc. Ser. B Stat. Methodol. 77(1), 291–318 (2015)

    Article  MathSciNet  MATH  Google Scholar 

  • Heckerman, D., Geiger, D., Chickering, D.M.: Learning Bayesian networks: the combination of knowledge and statistical data. Mach. Learn. 20(3), 197–243 (1995)

    MATH  Google Scholar 

  • Herskovits, E., Cooper, G.: Kutató: an entropy-driven system for construction of probabilistic expert systems from databases. In: Proceedings of the Sixth Annual Conference on Uncertainty in Artificial Intelligence, pp. 117–128. Elsevier Science Inc. (1990)

  • Kalisch, M., Bühlmann, P.: Estimating high-dimensional directed acyclic graphs with the PC-algorithm. J. Mach. Learn. Res. 8, 613–636 (2007)

    MATH  Google Scholar 

  • Kalisch, M., Mächler, M., Colombo, D., Maathuis, M.H., Bühlmann, P.: Causal inference using graphical models with the R package pcalg. J. Stat. Softw. 47(11), 1–26 (2012). http://www.jstatsoft.org/v47/i11/

  • Kou, S., Zhou, Q., Wong, W.H.: Equi-energy sampler with applications in statistical inference and statistical mechanics (with discussion). Ann. Stat. 34, 1581–1652 (2006)

    Article  MATH  Google Scholar 

  • Lam, W., Bacchus, F.: Learning Bayesian belief networks: an approach based on the MDL principle. Comput. Intell. 10(3), 269–293 (1994)

    Article  Google Scholar 

  • Lee, J.D., Simchowitz, M., Jordan, M.I., Recht, B.: Gradient descent only converges to minimizers, pp. 1–12 (2016)

  • Meganck, S., Leray, P., Manderick, B.: Learning causal Bayesian networks from observations and experiments: a decision theoretic approach. In: International Conference on Modeling Decisions for Artificial Intelligence, pp. 58–69. Springer (2006)

  • Meier, L., van de Geer, S., Bühlmann, P.: The group lasso for logistic regression. J. R. Stat. Soc. Ser. B Stat. Methodol. 70(1), 53–71 (2008)

    Article  MathSciNet  MATH  Google Scholar 

  • Pearl, J.: Causality: models, reasoning, and inference. Econom. Theory 19, 675–685 (2003)

    Article  Google Scholar 

  • Peér, D., Regev, A., Elidan, G., Friedman, N.: Inferring subnetworks from perturbed expression profiles. Bioinformatics 17(suppl 1), S215–S224 (2001)

    Article  Google Scholar 

  • Pournara, I., Wernisch, L.: Reconstruction of gene networks using Bayesian learning and manipulation experiments. Bioinformatics 20(17), 2934–2942 (2004)

    Article  Google Scholar 

  • Sachs, K., Perez, O., Peér, D., Lauffenburger, D.A., Nolan, G.P.: Causal protein-signaling networks derived from multiparameter single-cell data. Science 308(5721), 523–529 (2005)

    Article  Google Scholar 

  • Schmidt, M., Murphy, K.: Lassoordersearch: learning directed graphical model structure using \(\ell _1\)-penalized regression and order search. Learning 8(34), 2 (2006)

    Google Scholar 

  • Schmidt, M., Niculescu-Mizil, A., Murphy, K., et al.: Learning graphical model structure using \(\ell _1\)-regularization paths. AAAI 7, 1278–1283 (2007)

    Google Scholar 

  • Scutari, M.: Learning Bayesian networks with the bnlearn R package. J. Stat. Softw. 35(3), 1–22 (2010). https://doi.org/10.18637/jss.v035.i03

    Article  MathSciNet  Google Scholar 

  • Scutari, M.: An empirical-Bayes score for discrete Bayesian networks. In: Conference on Probabilistic Graphical Models, pp. 438–448 (2016)

  • Scutari, M.: Bayesian network constraint-based structure learning algorithms: parallel and optimized implementations in the bnlearn R package. J. Stat. Softw. 77(2), 1–20 (2017). https://doi.org/10.18637/jss.v077.i02

    Article  Google Scholar 

  • Shojaie, A., Michailidis, G.: Penalized likelihood methods for estimation of sparse high-dimensional directed acyclic graphs. Biometrika 97(3), 519–538 (2010)

    Article  MathSciNet  MATH  Google Scholar 

  • Shojaie, A., Jauhiainen, A., Kallitsis, M., Michailidis, G.: Inferring regulatory networks by combining perturbation screens and steady state gene expression profiles. PLoS ONE 9(2), e82393 (2014)

    Article  Google Scholar 

  • Spirtes, P., Glymour, C., Scheines, R.: Causation, Prediction, and Search. Springer, New York (1993)

    Book  MATH  Google Scholar 

  • Suzuki, J.: A construction of Bayesian networks from databases based on an MDL principle. In: Proceedings of the Ninth Annual Conference on Uncertainty in Artificial Intelligence, pp. 266–273 (1993)

  • Tsamardinos, I., Brown, L.E., Aliferis, C.F.: The max–min hill-climbing Bayesian network structure learning algorithm. Mach. Learn. 65(1), 31–78 (2006)

    Article  Google Scholar 

  • Tseng, P., Yun, S.: A coordinate gradient descent method for nonsmooth separable minimization. Math. Program. 117(1), 387–423 (2009)

    Article  MathSciNet  MATH  Google Scholar 

  • van de Geer, S., Bühlmann, P.: \(\ell _0\)-penalized maximum likelihood for sparse directed acyclic graphs. Ann. Stat. 41(2), 536–567 (2013)

    Article  MATH  Google Scholar 

  • Venables, W.N., Ripley, B.D.: Modern Applied Statistics with S, 4th edn. Springer, New York (2002). http://www.stats.ox.ac.uk/pub/MASS4. ISBN 0-387-95457-0

  • Watts, D.J., Strogatz, S.H.: Collective dynamics of ‘small-world’ networks. Nature 393(6684), 440–442 (1998)

    Article  MATH  Google Scholar 

  • Wu, T., Lange, K.: Coordinate descent algorithms for lasso penalized regression. Ann. Appl. Stat. 2, 224–244 (2008)

    Article  MathSciNet  MATH  Google Scholar 

  • Yuan, M., Lin, Y.: Model selection and estimation in regression with grouped variables. J. R. Stat. Soc. Ser. B Stat. Methodol. 68(1), 49–67 (2006)

    Article  MathSciNet  MATH  Google Scholar 

  • Zhou, Q.: Multi-domain sampling with applications to structural inference of Bayesian networks. J. Am. Stat. Assoc. 106(496), 1317–1330 (2011)

    Article  MathSciNet  MATH  Google Scholar 

  • Zhu, J., Hastie, T.: Classification of gene microarrays by penalized logistic regression. Biostatistics 5(3), 427–443 (2004)

    Article  MATH  Google Scholar 

Download references

Acknowledgements

This work was supported by NSF Grant IIS-1546098 (to Q.Z.).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Qing Zhou.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 248 KB)

Appendix: Asymptotic theory

Appendix: Asymptotic theory

In this appendix, we establish asymptotic theory for the DAG estimator \(\hat{\varvec{\beta }}_{\lambda }\) (12) assuming that p is fixed and \(n\rightarrow \infty \). By rearranging and relabeling individual components, we rewrite \(\varvec{\beta }\) as \(\varvec{\phi }=(\varvec{\phi }_{(1)}, \varvec{\phi }_{(2)})\), where \(\varvec{\phi }_{(1)} = \text {vec}( \varvec{\beta }_{1\cdot 1},\ldots , \varvec{\beta }_{1\cdot p}, \ldots , \varvec{\beta }_{p\cdot 1},\ldots , \varvec{\beta }_{p\cdot p})\) is the parameter vector of interest and \(\varvec{\phi }_{(2)} = \text {vec}(\varvec{\beta }_{1\cdot 0}, \ldots ,\varvec{\beta }_{p\cdot 0})\) denotes the vector of intercepts. Hereafter, we denote by \(\phi _j\) the jth group of \(\varvec{\phi }\), such that \(\phi _1 = \varvec{\beta }_{1\cdot 1}\), \(\phi _2 = \varvec{\beta }_{1\cdot 2}, \ldots , \phi _{p^2}=\varvec{\beta }_{p\cdot p}\), and so on. We say \(\varvec{\phi }\) is acyclic if the graph \(\mathcal {G}_{\varvec{\phi }}\) induced by \(\varvec{\phi }\) (or the corresponding \(\varvec{\beta }\)) is acyclic.

Define \(\varvec{\phi }_{[k]}\) (\(k\in \{1,\ldots ,p\}\)) to be the parameter vector obtained from \(\varvec{\phi }\) by setting \(\varvec{\beta }_{k\cdot i} = \mathbf {0}\) for \(i=1, \ldots , p\). In other words, the DAG \(\mathcal {G}_{\varvec{\phi }_{[k]}}\) is obtained by deleting all edges pointing to the kth node in \(\mathcal {G}_{\varvec{\phi }}\); see (10). We assume the data set \(\mathcal {X}\) consists of \((p+1)\) blocks, denoted by \(\mathcal {X}^j\) of size \(n_j \times p\), \(j=1,\ldots ,p+1\). The node \(X_j\) is experimentally fixed in \(\mathcal {X}^j\) for the first p blocks, while the last block contains purely observational data. Let \(\mathcal {I}_j\) be the set of row indices of \(\mathcal {X}^j\). As demonstrated by (2), we can model interventional data in the kth block of the data matrix \(\mathcal {X}^k\) as i.i.d. observations from a joint distribution factorized according to \(\mathcal {G}_{\varvec{\phi }_{[k]}}\). Denote the corresponding probability mass function by \(p(\mathbf {x}|\varvec{\phi }_{[k]})\), where \(\mathbf {x}=(x_1,\ldots ,x_p)\) and \(x_j \in \{ 1, \ldots , r_j \}\) for \(j=1,\ldots ,p\). To simplify our notation, denote the parameter for the \((p+1)\)th block by \(\varvec{\phi }_{[p+1]} = \varvec{\phi }\). Then the log-likelihood of \(\mathcal {X}\) is

$$\begin{aligned} L(\varvec{\phi }) = \sum _{k=1}^{p+1} L_k(\varvec{\phi }_{[k]}) = \sum _{k=1}^{p+1} \log p(\mathcal {X}^k \mid \varvec{\phi }_{[k]}), \end{aligned}$$
(23)

where \(\log p(\mathcal {X}^k | \varvec{\phi }_{[k]})=\sum _{h \in \mathcal {I}_k} \log (p(\mathcal {X}_{h\cdot }| \varvec{\phi }_{[k]}))\) and \(\mathcal {X}_{h\cdot } =(\mathcal {X}_{h1},\ldots ,\mathcal {X}_{hp})\). The penalized log-likelihood function with a tuning parameter \(\lambda _n>0\) is

$$\begin{aligned} R(\varvec{\phi })= & {} L(\varvec{\phi })-\lambda _n\sum _{j=1}^{p^2}||\phi _j||_2 \nonumber \\= & {} \sum _{k=1}^{p+1}L_k(\varvec{\phi }_{[k]})-\lambda _n\sum _{j=1}^{p^2}||\phi _j||_2, \end{aligned}$$
(24)

where the component group \(\phi _j\,(j=1,\ldots ,p^2)\) represents the influence of one variable on another. Let \(\Omega = \{\varvec{\phi }: \mathcal {G}_{\varvec{\phi }} \text { is a DAG}\}\) be the parameter space. A penalized estimator \(\hat{\varvec{\phi }}\) is obtained by maximizing \(R(\varvec{\phi })\) in \(\Omega \).

Though interventional data help distinguish equivalent DAGs, the following notion of natural parameters is needed to completely establish identifiability of DAGs for the case where each variable has interventional data. We say that i is an ancestor of j in a DAG \(\mathcal {G}\) if there exists at least one path from i to j. Denote the set of ancestors of j by \(\text {an}(j)\).

Definition 1

(Natural parameters) We say that \(\varvec{\phi } \in \Omega \) is natural if \(i \in \text {an}(j) \text { in } \mathcal {G}_{\varvec{\phi }}\) implies that j is not independent of i under the joint distribution given by \(\varvec{\phi }_{[i]}\) for all \(i,j=1,\ldots ,p\).

For a causal DAG, a natural parameter implies that the effects along multiple causal paths connecting the same pair of nodes do not cancel. This is a reasonable assumption for many real-world problems and is much weaker than the faithfulness assumption. Under the faithfulness assumption, all conditional independence restrictions can be read off from d-separations in the DAG. If nodes i and j are independent in \(\varvec{\phi }_{[i]}\), then by faithfulness the nodes i and j must be separated by empty set and thus \(i \notin \text {an}(j)\) in \(\mathcal {G}_{\varvec{\phi }_{[i]}}\). This implies that \(i \notin \text {an}(j)\) in \(\mathcal {G}_{\varvec{\phi }}\) as well, by the construction of \(\mathcal {G}_{\varvec{\phi }_{[i]}}\). Indeed, we see that the faithfulness assumption implies the natural parameter assumption.

To establish asymptotic properties of our penalized likelihood estimator, we make the following assumptions:

  1. (A1)

    The true parameter \(\varvec{\phi }^*\) is natural and an interior point of \(\Omega \).

  2. (A2)

    The parameter \(\varvec{\theta }_j\) of the conditional distribution \([X_j | \Uppi _j^{\mathcal {G}}; \varvec{\theta }_j]\) is identifiable for each \(j=1,\ldots ,p\). The log-likelihood function \(\ell _j(\varvec{\theta }_j) = \log p({x}_j|\Uppi _j^{\mathcal {G}}; \varvec{\theta }_j)\) is strictly concave and continuously three times differentiable for any interior point.

Recall that the kth block of our data, \(\mathcal {X}^k\), can be regarded as an i.i.d. sample of size \(n_k\) from the distribution \(p(\mathbf {x}|\varvec{\phi }_{[k]}^*)\) for all k, while we define \(\varvec{\phi }_{[p+1]}^*=\varvec{\phi }^*\) for the last block of observational data.

Theorem 1

Assume (A1) and (A2). If \(p(\mathbf {x}|\varvec{\phi }_{[k]})=p(\mathbf {x}|\varvec{\phi }_{[k]}^*)\) for all possible \(\mathbf {x}\) and all \(k=1,\ldots ,p\), then \(\varvec{\phi }=\varvec{\phi }^*\). Furthermore, if \(n_k\gg \sqrt{n}\) for all \(k=1,\ldots ,p\), then for any \(\varvec{\phi } \ne \varvec{\phi }^*\),

$$\begin{aligned} P(L(\varvec{\phi }^*)>L(\varvec{\phi })) \rightarrow 1 \quad \text { as } n \rightarrow \infty . \end{aligned}$$
(25)

Theorem 2

Assume (A1) and (A2). If \(\lambda _n/\sqrt{n}\rightarrow 0\) and \(n_k\gg \sqrt{n}\) for all \(k=1,\ldots ,p\), then there exists a global maximizer \(\hat{\varvec{\phi }}\) of \(R(\varvec{\phi })\) such that \(||\hat{\varvec{\phi }}-\varvec{\phi }^*||_2=O_p(n^{-1/2})\).

Proofs of the two theorems are relegated to Supplemental Material. Theorem 1 confirms that the causal DAG model is identifiable with interventional data assuming a natural parameter. Theorem 2 implies that there is a \(\sqrt{n}\)-consistent global maximizer of \(R(\varvec{\phi })\) with the group norm penalty. Note that Assumption (A2) does not specify a particular choice of model for the conditional distribution \([X_j | \Uppi _j^{\mathcal {G}}]\), and thus, these theoretical results apply to a large class of DAG models for discrete data. In particular, the multi-logit regression model (4) satisfies (A2).

Remark 2

The assumption on the sample size of interventional data, \(n_k\gg \sqrt{n}\), imposes a lower bound on how fast the fraction \(\alpha _k=n_k/n\gg n^{-1/2}\) can approach zero for \(k=1,\ldots ,p\). Although this allows the observational data to dominate when \(\alpha _k\rightarrow 0\), the fractions of interventional data must be larger than the typical order \(O_p(n^{-1/2})\) of statistical errors so that (25) can hold to establish identifiability of the true causal DAG parameter \(\varvec{\phi }^*\). This guarantees that the global maximizer \(\hat{\varvec{\phi }}\) will locate in a neighborhood of \(\varvec{\phi }^*\) with high probability. Once in this neighborhood, the convergence rate of \(\hat{\varvec{\phi }}\) then depends on the size n of all data, both interventional and observational. Therefore, increasing the size of observation data will lead to more accurate estimate \(\hat{\varvec{\phi }}\) as long as we keep \(\alpha _k\gg n^{-1/2}\) for \(k=1,\ldots ,p\).

Remark 3

It is interesting to generalize the above asymptotic results to the case where \(p=p_n\) grows with the sample size n, say, by developing nonasymptotic bounds on the \(\ell _2\) estimation error \(\Vert \hat{\varvec{\phi }}-\varvec{\phi }^*\Vert _2\). However, in order to estimate the causal network consistently, sufficient interventional data are needed for each node, i.e., \(n_k\) must approach infinity, and thus, \(p/ n \rightarrow 0\) as \(n\rightarrow \infty \). This limits us to the low-dimensional setting with \(p<n\). Suppose we have a large network with \(p\gg n\). One may first apply some regularization method on observational data to screen out independent nodes and to partition the network into small subgraphs that are disconnected to one another. Then for each small subgraph, we can afford to generate enough interventional data for every node and apply the method in this paper to infer the causal structure. Our asymptotic theory provides useful guidance for the analysis in the second step.

For purely observational data, the theory becomes more complicated due to the existence of equivalent DAGs and parameterizations. It is left as future work to establish the consistency of a global maximizer for high-dimensional observational data.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Gu, J., Fu, F. & Zhou, Q. Penalized estimation of directed acyclic graphs from discrete data. Stat Comput 29, 161–176 (2019). https://doi.org/10.1007/s11222-018-9801-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11222-018-9801-y

Keywords

Navigation