Abstract
Directed acyclic graphs (DAGs) have been widely used to model the causal relationships among variables using multivariate data. However, covariates are often available together with these data which may influence the underlying causal network. Motivated by such kind of data, in this paper, we incorporate the covariates directly into the DAGs to model the dependency relationships among nodal variables. Specifically, the causal strengths are assumed to be a linear function of the covariates, which enhances the interpretability and flexibility of the model. We fit the model in the \(l_1\) penalized maximum likelihood framework and employ a coordinate descent based algorithm to solve the resulting optimization problem. The consistency of the estimator are also established under the regime where the order of nodal variables are known. Finally, we evaluate the performance of the proposed method through a series of simulations and a lung cancer data example.
Similar content being viewed by others
References
Aragam B, Zhou Q (2015) Concave penalized estimation of sparse Gaussian Bayesian networks. J Mach Learn Res 16:2273–2328
Barabási AL, Albert R (2002) Statistical mechanics of complex networks. Rev Mod Phys 74:47–97
Beer DG, Kardia SL, Huang CC, Giordano TJ, Levin AM, Misek DE, Lin L, Chen G, Gharib TG, Thomas DG, Lizyness ML, Kuick R, Hayasaka S, Taylor JM, Iannettoni MD, Orringer MB, Hanash S (2002) Gene-expression profiles predict survival of patients with lung adenocarcinoma. Nat Med 8:816–824
Cai T, Liu W, Luo X (2011) A constrained \(l_1\) minimization approach to sparse precision matrix estimation. J Am Stat Assoc 106:594–607
Cai T, Liu W, Xie J (2013) Covariate adjusted precision matrix estimation with an application in genetical genomics. Biometrika 100:139–156
Chen M, Zhao R, Zhao H, Zhou H (2016) Asymptotically normal and efficient estimation of covariate-adjusted Gaussian graphical model. J Am Stat Assoc 111:394–406
Cheng J, Levina E, Wang P, Zhu J (2014) A sparse ising model with covariates. Biometrics 70:943–953
Cooper GF, Herskovits E (1992) A Bayesian method for the induction of probabilistic networks from data. Mach Learn 9:309–347
Dorton M, Maathuis M (2017) Structure learning in graphical modeling. Annu Rev Stat Appl 4:3.1–3.29
Edwards D (2000) Introduction to graphical modelling. Springer, New York
Friedman J, Hastie T, Höfling H, Tibshirani R (2007) Pathwise coordinate optimization. Ann Appl Stat 1:302–332
Friedman J, Hastie T, Tibshirani R (2008) Sparse inverse covariance estimation with the graphical lasso. Biostatistics 9:432–441
Friedman J, Hastie T, Tibshirani R (2010) Regularization paths for generalized linear models via coordinate descent. J Stat Softw 33:1–22
Fu F, Zhou Q (2013) Learning sparse causal Gaussian networks with experimental intervention: regularization and coordinate descent. J Am Stat Assoc 108:288–300
Gao B, Cui Y (2015) Learning directed acyclic graphical structures with genetical genomics data. Bioinformatics 31:3953–3960
Ha MJ, Sun W, Xie J (2016) Penpc: a two-step approach to estimate the skeletons of high-dimensional directed acyclic graphs. Biometrics 72:146–155
Han SW, Chen G, Cheon MS, Zhong H (2016) Estimation of directed acyclic graphs through two-stage adaptive lasso for gene network inference. J Am Stat Assoc 111:1004–1019
Ising E (1925) Beitrag zur theorie der ferromagnetismus. Z Phys 31:253–258
Kalisch M, Bühlmann P (2007) Estimating high-dimensional directed acyclic graphs with the pc-algorithm. J Mach Learn Res 8:613–636
Lam W, Bacchus F (1994) Learning Bayesian belief networks: an approach based on the MDL principle. Comput Intel 10:269–293
Lam C, Fan J (2009) Sparsistency and rates of convergence in large covariance matrices estimation. Ann Stat 37:4254–4278
Lauritzen S (1996) Graphical models. Oxford University Press, Oxford
Leng C, Tang CY (2012) Sparse matrix graphical models. J Am Stat Assoc 107:1187–1200
Leung D, Drton M, Hara H (2016) Identifiability of directed gaussian grpahical models with one latent source. Electron J Stat 10:394–422
Liang X, Young W, Huang L, Raftery A, Yeung K (2017) Integration of multiple data sources for gene network inference using genetic perturbation data. https://doi.org/10.1101/158394
Lin J, Basu S, Banerjee M, Michailidis G (2016) Penalized maximum likelihood estimation of multi-layered Gaussian graphical models. J Mach Learn Res 17:1–51
Liu H, Chen X, Lafferty J, Wasserman L (2010) Graph-valued regression. In: Proceedings of Advances in Neural Information Processing Systems, vol 23
Meinshausen N, Bühlmann P (2006) High-dimensional graphs with the lasso. Ann Stat 34:1436–1462
Ni Y, Stingo FC, Baladandayuthapani V (2017) Sparse multi-dimensional graphical models: a unified bayesian framework. J Am Stat Assoc 112:779–793
Pearl J (2000) Causality: models, reasoning, and inference. Cambridge University Press, Cambridge
Peng J, Wang P, Zhou N, Zhu J (2009) Partial correlation estimation by joint sparse regression model. J Am Stat Assoc 104:735–746
Peters J, Bühlmann P (2014) Identifiability of Gaussian structural equation models with equal error variances. Biometrika 101:219–228
Ravikumar P, Wainwright MJ, Lafferty J (2010) High-dimensional ising model selection using \(l_1\)-regularized logistic regression. Ann Stat 38:1287–1319
Ravikumar P, Raskutti G, Wainwright MJ (2011) High-dimensional covariance estimation by minimizing \(l_1\)-penalized log-determinant. Electron J Stat 5:935–980
Rothman AJ, Bickel PJ, Levina E, Zhu J (2008) Sparse permutation invariant covariance estimation. Electron J Stat 2:494–515
Shojaie A, Michailidis G (2010) Penalized likelihood methods for estimation of sparse high dimensional directed acyclic graphs. Biometrika 97:519–538
Shojaie A, Jauhiainen A, Kallitsis M, Michailidis G (2014) Inferring regulatory networks by combining perturbation screens and steady state gene expression profiles. PLoS ONE 9(e82):392
Spirtes P, Glymour C, Scheines R (2000) Causation, prediction, and search. The MIT Press, Cambridge
van de Geer S, Bühlmann P (2013) \(l_0\)-penalized maximum likelihood for sparse directed acyclic graphs. Ann Stat 41:536–567
Wainwright MJ (2009) Sharp thresholds for high-dimensional and noisy sparsity recovery using \(l_1\) constrained quadratic programming (lasso). IEEE Trans Inf Theory 55:2183–2202
Witten DM, Friedman JH, Simon N (2011) New insights and faster computations for the graphical lasso. J Comput Graph Stat 20:892–900
Wu T, Lange K (2008) Coordinate descent procedures for lasso penalized regression. Ann Appl Stat 2:224–244
Yin J, Li H (2011) A sparse conditional gaussian graphical model for analysis of genetical genomics data. Ann Appl Stat 5:2630–2650
Yuan M, Lin Y (2007) Model selection and estimation in the gaussian graphical model. Biometrika 94:19–35
Zhao P, Yu B (2006) On model selection consistency of lasso. J Mach Learn Res 7:2541–2567
Zhou S (2014) Gemini: graph estimation with matrix variate normal instances. Ann Stat 42:532–562
Acknowledgements
We are very grateful to the Editor, the Associate Editor and the two referees for their careful work and helpful comments which led to an improved version of this paper.
Author information
Authors and Affiliations
Corresponding author
Additional information
This work was partially supported by National Natural Science Foundation of China under Grant Number 11571011.
Appendix
Appendix
In this section, we give the proof of Theorem 1. To simplify the notation, we drop j which indexes the \(j\hbox {th}\) problem. Define the sample version of \(\varvec{I}^0\) as
Sketch of Proof We break down the proofs into a sequence of lemmas. In what follows, we first prove the results when \(\mathbf{A1 }\) and \(\mathbf{A2 }\) are satisfied by \(\varvec{I}^n\), the sample version of \(\varvec{I}^0\), which is presented in Lemma 1. Then we prove that \(\mathbf{A1 }\) and \(\mathbf{A2 }\) hold for \(\varvec{I}^0\) implies their counterpart hold for \(\varvec{I}^n\) with high probability. And it is established in Lemma 2. The proof of Lemma 1 is based on a technique called primal-dual witness method (Wainwright 2009). It consists of constructing a pair \((\check{\varvec{\beta }},\check{\varvec{z}})\) following a specific sequence of steps. Generally, the \(\check{\varvec{\beta }}\) is constructed to be zero on the non-support \(\varvec{S}^c\) of the true \(\varvec{\beta }^0\), and the elements on the true support \(\varvec{S}\) are constructed according to (7) except that the solution is restricted on \(\varvec{S}\). \(\check{\varvec{z}}\) is built such that with high probability, \((\check{\varvec{\beta }},\check{\varvec{z}})\) together satisfy the optimality conditions of problem (7) as shown in the following Proposition 1. In this way, if the constructive procedure succeeds, \(\check{\varvec{\beta }}\) is the optimal solution to (7). Note that \(\check{\varvec{\beta }}_{\varvec{S}^c}=\varvec{0}\), and thus the sign consistency can be obtained by evaluating the \(l_\infty \) bound of \(\check{\varvec{\beta }}_{\varvec{S}}-\varvec{\beta }^0_{\varvec{S}}\). It should be kept in mind that the procedure is not a practical algorithm but a proof technique.
Lemma 1
If \(\mathbf{A1 }\) and \(\mathbf{A2 }\) hold for \(\varvec{I}^n\),
and
then with probability larger than \(1-2\,{\exp }(-C\lambda ^2n)\), the results of Theorem 1 hold.
Before proving Lemma 1, we rewrite the Lemma 1 of Wainwright (2009) using our notation as the following Proposition 1.
Proposition 1
Denote \(\varvec{w}=(\varvec{x}^i\otimes \varvec{y}^i_{\setminus j})_{i=1,\ldots ,n}\) as the design matrix, then,
-
(a)
\(\hat{\varvec{\beta }}\) is the optimal solution to (7) if and only if there exists a subgradient vector \(\hat{\varvec{z}}\in \partial \Vert \hat{\varvec{\beta }}\Vert _1\) such that
$$\begin{aligned} \frac{1}{n}\varvec{w}'\varvec{w}(\hat{\varvec{\beta }}-\varvec{\beta }^0)-\frac{1}{n}\varvec{w}'\varvec{\epsilon }+\lambda \hat{\varvec{z}}=0. \end{aligned}$$(12) -
(b)
Assume that the subgradient vector satisfies the strict feasibility condition \(\hat{z}_l<1\) for all \(l\in \varvec{S}^c(\hat{\varvec{\beta }})\). Then any optimal solution \( \tilde{\varvec{\beta }}\) satisfies \( \tilde{\varvec{\beta }}_l=0\) for all \(l\in \varvec{S}^c(\hat{\varvec{\beta }})\), where \(\varvec{S}^c(\hat{\varvec{\beta }})\) stands for the non-support index set of \(\hat{\varvec{\beta }}\).
-
(c)
Under the conditions of (b), if \(\varvec{w}'_{\varvec{S}(\hat{\varvec{\beta }})}\varvec{w}_{\varvec{S}(\hat{\varvec{\beta }})}\) is invertible, then \(\hat{\varvec{\beta }}\) is the unique optimal solution of the lasso problem (7).
Proposition 1 combined with the condition \(\mathbf{A2 }\) will be used to show the uniqueness of the estimator. Now we start to prove Lemma 1.
Proof of Lemma 1
We use the primal-dual witness (PDW) method (Wainwright 2009) to prove the results, which consists of constructing a pair \((\check{\varvec{\beta }},\check{\varvec{z}})\) by the following four steps.
Step 1: Set \(\check{\varvec{\beta }}_{\varvec{S}^c}=\varvec{0}\) and construct \(\check{\varvec{\beta }}_{\varvec{S}}\) as the solution of the following restricted lasso problem,
By the sample version of condition \(\mathbf{A2 }\), we know that the solution to (13) is unique.
Step 2: Choose \(\check{\varvec{z}}_{\varvec{S}}\) as one of the subdifferential of the \(l_1\) norm at \(\check{\varvec{\beta }}_{\varvec{S}}\), i.e., \(\check{\varvec{z}}_{\varvec{S}}\in \partial \Vert \check{\varvec{\beta }}_{\varvec{S}}\Vert _1 \).
Step 3: Set \(\check{\varvec{z}}_{\varvec{S}^c}\) as the solution to problem (12) and prove the strict dual feasibility condition \(|\check{\varvec{z}}_{l}|<1\) for all \(l\in \varvec{S}^c\), which ensures the optimality.
Step 4: Prove the sign consistency by establishing the \(l_\infty \) bound of \(\check{\varvec{\beta }}_{\varvec{S}}-\varvec{\beta }^0_{\varvec{S}}.\)
By Proposition 1, if the above four steps succeed, then \(\hat{\varvec{\beta }}=(\check{\varvec{\beta }}_{\varvec{S}},\varvec{0})\) is the unique solution to the lasso problem (7). Moreover, \(\hat{\varvec{\beta }}\) correctly identifies the support of the true \(\varvec{\beta }^0\) with the correct signs. For the success of the PDW, indeed, we only need to show Step 3 and Step 4 hold with high probability. Next, we prove them respectively.
We begin with verifying the strict dual feasibility. By the construction of PDW, for every \(l\in \varvec{S}^c\), it is easy to obtain
where \(\varPi _{\varvec{w}_{\varvec{S}}^\bot }=\varvec{I}_{n\times n}-\varvec{w}_{\varvec{S}}(\varvec{w}'_{\varvec{S}}\varvec{w}_{\varvec{S}})^{-1}\varvec{w}'_{\varvec{S}}\) is a projection matrix, and \(\check{\varvec{z}}_{\varvec{S}}\) is the subgradient vector which is chosen as Step 2. Denote
and
then we can write
We study \(\mu _l\) and \(\tilde{z}_l\) respectively. Since \(\check{\varvec{z}}_{\varvec{S}}\) is the subgradient vector of the \(l_1\) norm, \(\Vert \check{\varvec{z}}_{\varvec{S}}\Vert _\infty \le 1\). This combined with the sample version of the incoherence condition \(\mathbf{A1 }\) implies \(|\mu _l|\le 1-\gamma \). Then we turn to study \(\tilde{z}_l\). Conditioning on \(\{\varvec{x}^i\}_{i=1}^n\) and \(\{\varvec{y}^i_{\setminus j}\}_{i=1}^n\), and using the property of Gaussian distribution, we obtain that \(\tilde{z}_l\) is conditional Gaussian with variance at most
where we have used the condition \(M_n=\Vert \varvec{X}\otimes \varvec{Y}_{\setminus j}\Vert _\infty <\infty \) and the fact that the projection matrix \(\varPi _{\varvec{w}_{\varvec{S}}^\bot }\) has spectral norm one to derive the second inequality. Therefore, by the Gaussian tail bound and the union bound, we have
where t is any non-negative constant and recall that k is the maximum cardinality of \(\varvec{S}_j\). Let \(t=\frac{\gamma }{2},\) we have
Since \(\lambda \ge \frac{2M_n}{\gamma }\sqrt{\frac{2({\log }p +{\log }q)}{n}}\) and combing \(|\mu _l|\le 1-\gamma \), we then have
Finally, we arrive the conclusion by noting that
Thus, we have proved Step 3.
Next we prove the sign consistency of the estimator in Step 4 by establishing the \(l_\infty \) bound of \(\check{\varvec{\beta }}_{\varvec{S}}-\varvec{\beta }^0_{\varvec{S}}.\) By (12) and (13), it is easy to obtain
Denote \(\varDelta _i=\varvec{e}'_i(\check{\varvec{\beta }}_{\varvec{S}}-\varvec{\beta }^0_{\varvec{S}})\) as the \(i\hbox {th}\) element of \(\check{\varvec{\beta }}_{\varvec{S}}-\varvec{\beta }^0_{\varvec{S}}\), where \(\varvec{e}_i\) is a vector with 1 at the \(i\hbox {th}\) position and 0 elsewhere. By the triangle inequality,
For the second term, by the sample version of \(\mathbf{A2 }\),
Now we bound the first term. Let
Conditioning on \(\{\varvec{x}^i\}_{i=1}^n\) and \(\{\varvec{y}^i_{\setminus j}\}_{i=1}^n\), and by condition \(\mathbf{A2 }\) and the property of Gaussian distribution, \(V_i\) is zero-mean Gaussian with variance at most
Consequently, by the Gaussian tail bound and the union bound,
Set \(t=\frac{4\lambda }{\sqrt{C_{\mathrm{min}}}}\) and by the choice of \(\lambda \), we have
with probability larger than \(1-2\,{\exp }(-C\lambda ^2 n)\) conditioning on \(\{\varvec{x}^i\}_{i=1}^n\) and \(\{\varvec{y}^i_{\setminus j}\}_{i=1}^n\). Similar to the statements in (14), we arrive the final bound of \(\check{\varvec{\beta }}_{\varvec{S}}-\varvec{\beta }^0_{\varvec{S}}\). Further, by the choice of the minimal signal strength, i.e., \({\min }_{k,l}{|\beta ^0_{jkl}|}\ge C\frac{\lambda \sqrt{k}}{C_{\mathrm{min}}},\) then the estimator preserves sign consistency.
Finally, by Proposition 1 and the condition \(\mathbf{A2 }\), \(\hat{\varvec{\beta }}=(\check{\varvec{\beta }}_{\varvec{S}},\varvec{0})\) is the unique solution to the lasso problem (7). By the spirit of the PDW method, we arrive all the conclusions of Lemma 1. \(\square \)
Note that we have used the sample version of \(\mathbf{A1 }\) and \(\mathbf{A2 }\) to obtain the results of Theorem 1. We now provide the following Lemma 2 which says the sample version of \(\mathbf{A1 }\) and \(\mathbf{A2 }\) can be implied by the population version of \(\mathbf{A1 }\) and \(\mathbf{A2 }\) with high probability.
Lemma 2
If \(\mathbf{A1 }\) and \(\mathbf{A2 }\) holds for \(\varvec{I}^0\), and \(M_n=\Vert \varvec{X}\otimes \varvec{Y}_{\setminus j}\Vert _\infty <\infty , a.s.\), then for any \(\nu >0\) and some fixed positive constant A, B and D,
The proof of Lemma 2 is similar to that in Ravikumar et al. (2010), so we omit it here.
Proof of Theorem 1
We now employ Lemma 1 and Lemma 2 to verify Theorem 1. Let the results of Theorem 1 as event \({\mathcal {A}}\), and let
and
Then putting together all the pieces, we have
where C is some constant and may be different from term to term. By the choice of n and \(M_n\),
The proof is completed. \(\square \)
Rights and permissions
About this article
Cite this article
Guo, X., Zhang, H. Sparse directed acyclic graphs incorporating the covariates. Stat Papers 61, 2119–2148 (2020). https://doi.org/10.1007/s00362-018-1027-8
Received:
Revised:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00362-018-1027-8