Sparse directed acyclic graphs incorporating the covariates

Guo, Xiao; Zhang, Hai

doi:10.1007/s00362-018-1027-8

Sparse directed acyclic graphs incorporating the covariates

Regular Article
Published: 16 August 2018

Volume 61, pages 2119–2148, (2020)
Cite this article

Statistical Papers Aims and scope Submit manuscript

472 Accesses
3 Citations
Explore all metrics

Abstract

Directed acyclic graphs (DAGs) have been widely used to model the causal relationships among variables using multivariate data. However, covariates are often available together with these data which may influence the underlying causal network. Motivated by such kind of data, in this paper, we incorporate the covariates directly into the DAGs to model the dependency relationships among nodal variables. Specifically, the causal strengths are assumed to be a linear function of the covariates, which enhances the interpretability and flexibility of the model. We fit the model in the $l_1$ penalized maximum likelihood framework and employ a coordinate descent based algorithm to solve the resulting optimization problem. The consistency of the estimator are also established under the regime where the order of nodal variables are known. Finally, we evaluate the performance of the proposed method through a series of simulations and a lung cancer data example.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1

Penalized estimation of directed acyclic graphs from discrete data

Article 02 February 2018

An Overview of Recent Advancements in Causal Studies

Article 14 January 2016

LeCaSiM: Learning Causal Structure via Inverse of M-Matrices with Adjustable Coefficients

Article Open access 15 February 2024

Notes

http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE68571.

References

Aragam B, Zhou Q (2015) Concave penalized estimation of sparse Gaussian Bayesian networks. J Mach Learn Res 16:2273–2328
MathSciNet MATH Google Scholar
Barabási AL, Albert R (2002) Statistical mechanics of complex networks. Rev Mod Phys 74:47–97
MathSciNet MATH Google Scholar
Beer DG, Kardia SL, Huang CC, Giordano TJ, Levin AM, Misek DE, Lin L, Chen G, Gharib TG, Thomas DG, Lizyness ML, Kuick R, Hayasaka S, Taylor JM, Iannettoni MD, Orringer MB, Hanash S (2002) Gene-expression profiles predict survival of patients with lung adenocarcinoma. Nat Med 8:816–824
Google Scholar
Cai T, Liu W, Luo X (2011) A constrained $l_1$ minimization approach to sparse precision matrix estimation. J Am Stat Assoc 106:594–607
MATH Google Scholar
Cai T, Liu W, Xie J (2013) Covariate adjusted precision matrix estimation with an application in genetical genomics. Biometrika 100:139–156
MathSciNet MATH Google Scholar
Chen M, Zhao R, Zhao H, Zhou H (2016) Asymptotically normal and efficient estimation of covariate-adjusted Gaussian graphical model. J Am Stat Assoc 111:394–406
MathSciNet Google Scholar
Cheng J, Levina E, Wang P, Zhu J (2014) A sparse ising model with covariates. Biometrics 70:943–953
MathSciNet MATH Google Scholar
Cooper GF, Herskovits E (1992) A Bayesian method for the induction of probabilistic networks from data. Mach Learn 9:309–347
MATH Google Scholar
Dorton M, Maathuis M (2017) Structure learning in graphical modeling. Annu Rev Stat Appl 4:3.1–3.29
Google Scholar
Edwards D (2000) Introduction to graphical modelling. Springer, New York
MATH Google Scholar
Friedman J, Hastie T, Höfling H, Tibshirani R (2007) Pathwise coordinate optimization. Ann Appl Stat 1:302–332
MathSciNet MATH Google Scholar
Friedman J, Hastie T, Tibshirani R (2008) Sparse inverse covariance estimation with the graphical lasso. Biostatistics 9:432–441
MATH Google Scholar
Friedman J, Hastie T, Tibshirani R (2010) Regularization paths for generalized linear models via coordinate descent. J Stat Softw 33:1–22
Google Scholar
Fu F, Zhou Q (2013) Learning sparse causal Gaussian networks with experimental intervention: regularization and coordinate descent. J Am Stat Assoc 108:288–300
MathSciNet MATH Google Scholar
Gao B, Cui Y (2015) Learning directed acyclic graphical structures with genetical genomics data. Bioinformatics 31:3953–3960
Google Scholar
Ha MJ, Sun W, Xie J (2016) Penpc: a two-step approach to estimate the skeletons of high-dimensional directed acyclic graphs. Biometrics 72:146–155
MathSciNet MATH Google Scholar
Han SW, Chen G, Cheon MS, Zhong H (2016) Estimation of directed acyclic graphs through two-stage adaptive lasso for gene network inference. J Am Stat Assoc 111:1004–1019
MathSciNet Google Scholar
Ising E (1925) Beitrag zur theorie der ferromagnetismus. Z Phys 31:253–258
MATH Google Scholar
Kalisch M, Bühlmann P (2007) Estimating high-dimensional directed acyclic graphs with the pc-algorithm. J Mach Learn Res 8:613–636
MATH Google Scholar
Lam W, Bacchus F (1994) Learning Bayesian belief networks: an approach based on the MDL principle. Comput Intel 10:269–293
Google Scholar
Lam C, Fan J (2009) Sparsistency and rates of convergence in large covariance matrices estimation. Ann Stat 37:4254–4278
MATH Google Scholar
Lauritzen S (1996) Graphical models. Oxford University Press, Oxford
MATH Google Scholar
Leng C, Tang CY (2012) Sparse matrix graphical models. J Am Stat Assoc 107:1187–1200
MathSciNet MATH Google Scholar
Leung D, Drton M, Hara H (2016) Identifiability of directed gaussian grpahical models with one latent source. Electron J Stat 10:394–422
MathSciNet MATH Google Scholar
Liang X, Young W, Huang L, Raftery A, Yeung K (2017) Integration of multiple data sources for gene network inference using genetic perturbation data. https://doi.org/10.1101/158394
Lin J, Basu S, Banerjee M, Michailidis G (2016) Penalized maximum likelihood estimation of multi-layered Gaussian graphical models. J Mach Learn Res 17:1–51
MathSciNet MATH Google Scholar
Liu H, Chen X, Lafferty J, Wasserman L (2010) Graph-valued regression. In: Proceedings of Advances in Neural Information Processing Systems, vol 23
Meinshausen N, Bühlmann P (2006) High-dimensional graphs with the lasso. Ann Stat 34:1436–1462
MathSciNet MATH Google Scholar
Ni Y, Stingo FC, Baladandayuthapani V (2017) Sparse multi-dimensional graphical models: a unified bayesian framework. J Am Stat Assoc 112:779–793
MathSciNet Google Scholar
Pearl J (2000) Causality: models, reasoning, and inference. Cambridge University Press, Cambridge
MATH Google Scholar
Peng J, Wang P, Zhou N, Zhu J (2009) Partial correlation estimation by joint sparse regression model. J Am Stat Assoc 104:735–746
MathSciNet MATH Google Scholar
Peters J, Bühlmann P (2014) Identifiability of Gaussian structural equation models with equal error variances. Biometrika 101:219–228
MathSciNet MATH Google Scholar
Ravikumar P, Wainwright MJ, Lafferty J (2010) High-dimensional ising model selection using $l_1$-regularized logistic regression. Ann Stat 38:1287–1319
MATH Google Scholar
Ravikumar P, Raskutti G, Wainwright MJ (2011) High-dimensional covariance estimation by minimizing $l_1$-penalized log-determinant. Electron J Stat 5:935–980
MathSciNet MATH Google Scholar
Rothman AJ, Bickel PJ, Levina E, Zhu J (2008) Sparse permutation invariant covariance estimation. Electron J Stat 2:494–515
MathSciNet MATH Google Scholar
Shojaie A, Michailidis G (2010) Penalized likelihood methods for estimation of sparse high dimensional directed acyclic graphs. Biometrika 97:519–538
MathSciNet MATH Google Scholar
Shojaie A, Jauhiainen A, Kallitsis M, Michailidis G (2014) Inferring regulatory networks by combining perturbation screens and steady state gene expression profiles. PLoS ONE 9(e82):392
Google Scholar
Spirtes P, Glymour C, Scheines R (2000) Causation, prediction, and search. The MIT Press, Cambridge
MATH Google Scholar
van de Geer S, Bühlmann P (2013) $l_0$-penalized maximum likelihood for sparse directed acyclic graphs. Ann Stat 41:536–567
MATH Google Scholar
Wainwright MJ (2009) Sharp thresholds for high-dimensional and noisy sparsity recovery using $l_1$ constrained quadratic programming (lasso). IEEE Trans Inf Theory 55:2183–2202
MATH Google Scholar
Witten DM, Friedman JH, Simon N (2011) New insights and faster computations for the graphical lasso. J Comput Graph Stat 20:892–900
MathSciNet Google Scholar
Wu T, Lange K (2008) Coordinate descent procedures for lasso penalized regression. Ann Appl Stat 2:224–244
MathSciNet MATH Google Scholar
Yin J, Li H (2011) A sparse conditional gaussian graphical model for analysis of genetical genomics data. Ann Appl Stat 5:2630–2650
MathSciNet MATH Google Scholar
Yuan M, Lin Y (2007) Model selection and estimation in the gaussian graphical model. Biometrika 94:19–35
MathSciNet MATH Google Scholar
Zhao P, Yu B (2006) On model selection consistency of lasso. J Mach Learn Res 7:2541–2567
MathSciNet MATH Google Scholar
Zhou S (2014) Gemini: graph estimation with matrix variate normal instances. Ann Stat 42:532–562
MathSciNet MATH Google Scholar

Download references

Acknowledgements

We are very grateful to the Editor, the Associate Editor and the two referees for their careful work and helpful comments which led to an improved version of this paper.

Author information

Authors and Affiliations

School of Mathematics, Northwest University, Xi’an, 710069, China
Xiao Guo & Hai Zhang
Faculty of Information Technology, Macau University of Science and Technology, Macau, China
Hai Zhang

Authors

Xiao Guo
View author publications
You can also search for this author in PubMed Google Scholar
Hai Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hai Zhang.

Additional information

This work was partially supported by National Natural Science Foundation of China under Grant Number 11571011.

Appendix

In this section, we give the proof of Theorem 1. To simplify the notation, we drop j which indexes the $j\hbox {th}$ problem. Define the sample version of $\varvec{I}^0$ as

$$\begin{aligned} \varvec{I}^n=\frac{1}{n}\sum _{i=1}^n\left( \varvec{x}^i\otimes \varvec{y}^i_{\setminus j}\right) '\left( \varvec{x}^i\otimes \varvec{y}^i_{\setminus j}\right) . \end{aligned}$$

Sketch of Proof We break down the proofs into a sequence of lemmas. In what follows, we first prove the results when $\mathbf{A1 }$ and $\mathbf{A2 }$ are satisfied by $\varvec{I}^n$, the sample version of $\varvec{I}^0$, which is presented in Lemma 1. Then we prove that $\mathbf{A1 }$ and $\mathbf{A2 }$ hold for $\varvec{I}^0$ implies their counterpart hold for $\varvec{I}^n$ with high probability. And it is established in Lemma 2. The proof of Lemma 1 is based on a technique called primal-dual witness method (Wainwright 2009). It consists of constructing a pair $(\check{\varvec{\beta }},\check{\varvec{z}})$ following a specific sequence of steps. Generally, the $\check{\varvec{\beta }}$ is constructed to be zero on the non-support $\varvec{S}^c$ of the true $\varvec{\beta }^0$, and the elements on the true support $\varvec{S}$ are constructed according to (7) except that the solution is restricted on $\varvec{S}$. $\check{\varvec{z}}$ is built such that with high probability, $(\check{\varvec{\beta }},\check{\varvec{z}})$ together satisfy the optimality conditions of problem (7) as shown in the following Proposition 1. In this way, if the constructive procedure succeeds, $\check{\varvec{\beta }}$ is the optimal solution to (7). Note that $\check{\varvec{\beta }}_{\varvec{S}^c}=\varvec{0}$, and thus the sign consistency can be obtained by evaluating the $l_\infty $ bound of $\check{\varvec{\beta }}_{\varvec{S}}-\varvec{\beta }^0_{\varvec{S}}$. It should be kept in mind that the procedure is not a practical algorithm but a proof technique.

Lemma 1

If $\mathbf{A1 }$ and $\mathbf{A2 }$ hold for $\varvec{I}^n$,

$$\begin{aligned} M_n=\Vert \varvec{X}\otimes \varvec{Y}_{\setminus j}\Vert _\infty <\infty , a.s. \end{aligned}$$

and

$$\begin{aligned} \lambda \ge \frac{2M_n}{\gamma }\sqrt{\frac{2({\log }p+{\log }q)}{n}}, \end{aligned}$$

then with probability larger than $1-2\,{\exp }(-C\lambda ^2n)$, the results of Theorem 1 hold.

Before proving Lemma 1, we rewrite the Lemma 1 of Wainwright (2009) using our notation as the following Proposition 1.

Proposition 1

Denote $\varvec{w}=(\varvec{x}^i\otimes \varvec{y}^i_{\setminus j})_{i=1,\ldots ,n}$ as the design matrix, then,

(a)
$\hat{\varvec{\beta }}$ is the optimal solution to (7) if and only if there exists a subgradient vector $\hat{\varvec{z}}\in \partial \Vert \hat{\varvec{\beta }}\Vert _1$ such that
$$\begin{aligned} \frac{1}{n}\varvec{w}'\varvec{w}(\hat{\varvec{\beta }}-\varvec{\beta }^0)-\frac{1}{n}\varvec{w}'\varvec{\epsilon }+\lambda \hat{\varvec{z}}=0. \end{aligned}$$
(12)
(b)
Assume that the subgradient vector satisfies the strict feasibility condition $\hat{z}_l<1$ for all $l\in \varvec{S}^c(\hat{\varvec{\beta }})$. Then any optimal solution $ \tilde{\varvec{\beta }}$ satisfies $ \tilde{\varvec{\beta }}_l=0$ for all $l\in \varvec{S}^c(\hat{\varvec{\beta }})$, where $\varvec{S}^c(\hat{\varvec{\beta }})$ stands for the non-support index set of $\hat{\varvec{\beta }}$.
(c)
Under the conditions of (b), if $\varvec{w}'_{\varvec{S}(\hat{\varvec{\beta }})}\varvec{w}_{\varvec{S}(\hat{\varvec{\beta }})}$ is invertible, then $\hat{\varvec{\beta }}$ is the unique optimal solution of the lasso problem (7).

Proposition 1 combined with the condition $\mathbf{A2 }$ will be used to show the uniqueness of the estimator. Now we start to prove Lemma 1.

Proof of Lemma 1

We use the primal-dual witness (PDW) method (Wainwright 2009) to prove the results, which consists of constructing a pair $(\check{\varvec{\beta }},\check{\varvec{z}})$ by the following four steps.

Step 1: Set $\check{\varvec{\beta }}_{\varvec{S}^c}=\varvec{0}$ and construct $\check{\varvec{\beta }}_{\varvec{S}}$ as the solution of the following restricted lasso problem,

$$\begin{aligned} \check{\varvec{\beta }}_{\varvec{S}}= \;\hbox {arg min}\;\left\{ \frac{1}{2n}\sum _{i=1}^n(y_j^i-(\varvec{x}^i\otimes \varvec{y}^i_{\setminus j})_{\varvec{S}}\varvec{\beta }_{\varvec{S}})^2+\lambda \Vert \varvec{\beta }_{\varvec{S}}\Vert _1\right\} . \end{aligned}$$

(13)

By the sample version of condition $\mathbf{A2 }$, we know that the solution to (13) is unique.

Step 2: Choose $\check{\varvec{z}}_{\varvec{S}}$ as one of the subdifferential of the $l_1$ norm at $\check{\varvec{\beta }}_{\varvec{S}}$, i.e., $\check{\varvec{z}}_{\varvec{S}}\in \partial \Vert \check{\varvec{\beta }}_{\varvec{S}}\Vert _1 $.

Step 3: Set $\check{\varvec{z}}_{\varvec{S}^c}$ as the solution to problem (12) and prove the strict dual feasibility condition $|\check{\varvec{z}}_{l}|<1$ for all $l\in \varvec{S}^c$, which ensures the optimality.

Step 4: Prove the sign consistency by establishing the $l_\infty $ bound of $\check{\varvec{\beta }}_{\varvec{S}}-\varvec{\beta }^0_{\varvec{S}}.$

By Proposition 1, if the above four steps succeed, then $\hat{\varvec{\beta }}=(\check{\varvec{\beta }}_{\varvec{S}},\varvec{0})$ is the unique solution to the lasso problem (7). Moreover, $\hat{\varvec{\beta }}$ correctly identifies the support of the true $\varvec{\beta }^0$ with the correct signs. For the success of the PDW, indeed, we only need to show Step 3 and Step 4 hold with high probability. Next, we prove them respectively.

We begin with verifying the strict dual feasibility. By the construction of PDW, for every $l\in \varvec{S}^c$, it is easy to obtain

$$\begin{aligned} \check{\varvec{z}}_{l}=\varvec{w}'_l\left\{ \varvec{w}_{\varvec{S}}(\varvec{w}'_{\varvec{S}}\varvec{w}_{\varvec{S}})^{-1}\check{\varvec{z}}_{\varvec{S}}+\varPi _{\varvec{w}_{\varvec{S}}^\bot }\left( \frac{\varvec{\epsilon }}{\lambda n}\right) \right\} , \end{aligned}$$

where $\varPi _{\varvec{w}_{\varvec{S}}^\bot }=\varvec{I}_{n\times n}-\varvec{w}_{\varvec{S}}(\varvec{w}'_{\varvec{S}}\varvec{w}_{\varvec{S}})^{-1}\varvec{w}'_{\varvec{S}}$ is a projection matrix, and $\check{\varvec{z}}_{\varvec{S}}$ is the subgradient vector which is chosen as Step 2. Denote

$$\begin{aligned} \mu _l=\varvec{w}'_l\varvec{w}_{\varvec{S}}(\varvec{w}'_{\varvec{S}}\varvec{w}_{\varvec{S}})^{-1}\check{\varvec{z}}_{\varvec{S}}, \end{aligned}$$

and

$$\begin{aligned} \tilde{z}_l=\varvec{w}'_l\varPi _{\varvec{w}_{\varvec{S}}^\bot }\left( \frac{\varvec{\epsilon }}{\lambda n}\right) , \end{aligned}$$

then we can write

$$\begin{aligned} \check{ z}_{l}=\mu _l+\tilde{z}_l,\;\hbox {for}\;l\in \varvec{S}^c. \end{aligned}$$

We study $\mu _l$ and $\tilde{z}_l$ respectively. Since $\check{\varvec{z}}_{\varvec{S}}$ is the subgradient vector of the $l_1$ norm, $\Vert \check{\varvec{z}}_{\varvec{S}}\Vert _\infty \le 1$. This combined with the sample version of the incoherence condition $\mathbf{A1 }$ implies $|\mu _l|\le 1-\gamma $. Then we turn to study $\tilde{z}_l$. Conditioning on $\{\varvec{x}^i\}_{i=1}^n$ and $\{\varvec{y}^i_{\setminus j}\}_{i=1}^n$, and using the property of Gaussian distribution, we obtain that $\tilde{z}_l$ is conditional Gaussian with variance at most

$$\begin{aligned} \frac{1}{\lambda ^2 n^2}\Vert \varvec{w}'_l\varPi _{\varvec{w}_{\varvec{S}}^\bot }\Vert _2^2\le \frac{1}{\lambda ^2 n^2}\Vert \varPi _{\varvec{w}_{\varvec{S}}^\bot }(\varvec{w}_l)\Vert _2^2\le \frac{M_n^2}{\lambda ^2n}, \end{aligned}$$

where we have used the condition $M_n=\Vert \varvec{X}\otimes \varvec{Y}_{\setminus j}\Vert _\infty <\infty $ and the fact that the projection matrix $\varPi _{\varvec{w}_{\varvec{S}}^\bot }$ has spectral norm one to derive the second inequality. Therefore, by the Gaussian tail bound and the union bound, we have

$$\begin{aligned} {\mathbb {P}}\left( {\max }_{l\in \varvec{S}^c}|\tilde{z}_l|\ge t\mid \{\varvec{x}^i\}_{i=1}^n, \{\varvec{y}^i_{\setminus j}\}_{i=1}^n\right) \le 2(pq-k) {\exp }\left( -\frac{\lambda ^2 nt^2}{2M_n^2}\right) , \end{aligned}$$

where t is any non-negative constant and recall that k is the maximum cardinality of $\varvec{S}_j$. Let $t=\frac{\gamma }{2},$ we have

$$\begin{aligned} {\mathbb {P}}\left( {\max }_{l\in \varvec{S}^c}|\tilde{z}_l|\ge \frac{\gamma }{2}\mid \{\varvec{x}^i\}_{i=1}^n, \{\varvec{y}^i_{\setminus j}\}_{i=1}^n\right) \le 2\,{\exp }\left( -\frac{\lambda ^2 n\gamma ^2}{8M_n^2}+{\log }(pq-k)\right) . \end{aligned}$$

Since $\lambda \ge \frac{2M_n}{\gamma }\sqrt{\frac{2({\log }p +{\log }q)}{n}}$ and combing $|\mu _l|\le 1-\gamma $, we then have

$$\begin{aligned} {\mathbb {P}}\left( {\max }_{l\in \varvec{S}^c}|z_l|\ge 1-\frac{\gamma }{2}\mid \{\varvec{x}^i\}_{i=1}^n, \{\varvec{y}^i_{\setminus j}\}_{i=1}^n\right) \le 2\,{\exp }(-C\lambda ^2 n). \end{aligned}$$

Finally, we arrive the conclusion by noting that

$$\begin{aligned}&{\mathbb {P}}\left( {\max }_{l\in \varvec{S}^c}|z_l|\ge 1-\frac{\gamma }{2}\right) =E\left( \mathbb I_{{\max }_{l\in \varvec{S}^c}|z_l|\ge 1-\frac{\gamma }{2}}\right) \nonumber \\&\quad =E_{\varvec{X},\varvec{Y}} \left( E\left( \mathbb I_{{\max }_{l\in \varvec{S}^c}|z_l|\ge 1-\frac{\gamma }{2}}\mid \{\varvec{x}^i\}_{i=1}^n, \{\varvec{y}^i_{\setminus j}\}_{i=1}^n\right) \right) \nonumber \\&\quad =E_{\varvec{X},\varvec{Y}} \left( {\mathbb {P}}\left( {\max }_{l\in \varvec{S}^c}|z_l|\ge 1-\frac{\gamma }{2}\mid \{\varvec{x}^i\}_{i=1}^n, \{\varvec{y}^i_{\setminus j}\}_{i=1}^n\right) \right) \nonumber \\&\quad \le E_{\varvec{X},\varvec{Y}}\left( 2\,{\exp }(-C\lambda ^2 n)=2\,{\exp }(-C\lambda ^2 n)\right) . \end{aligned}$$

(14)

Thus, we have proved Step 3.

Next we prove the sign consistency of the estimator in Step 4 by establishing the $l_\infty $ bound of $\check{\varvec{\beta }}_{\varvec{S}}-\varvec{\beta }^0_{\varvec{S}}.$ By (12) and (13), it is easy to obtain

$$\begin{aligned} \check{\varvec{\beta }}_{\varvec{S}}-\varvec{\beta }^0_{\varvec{S}}=\left( \frac{1}{n}\varvec{w}'_{\varvec{S}}\varvec{w}_{\varvec{S}}\right) ^{-1} \left( \frac{1}{n}\varvec{w}'_{\varvec{S}}\varvec{\epsilon }-\lambda \hbox {sign}(\check{\varvec{z}}_{\varvec{S}})\right) . \end{aligned}$$

(15)

Denote $\varDelta _i=\varvec{e}'_i(\check{\varvec{\beta }}_{\varvec{S}}-\varvec{\beta }^0_{\varvec{S}})$ as the $i\hbox {th}$ element of $\check{\varvec{\beta }}_{\varvec{S}}-\varvec{\beta }^0_{\varvec{S}}$, where $\varvec{e}_i$ is a vector with 1 at the $i\hbox {th}$ position and 0 elsewhere. By the triangle inequality,

$$\begin{aligned} {\max }_{i\in \varvec{S}}|\varDelta _i|\le \Vert (\varvec{I}_{\varvec{S} \varvec{S}}^n)^{-1}\varvec{w}'_{\varvec{S}}\frac{\varvec{\epsilon }}{n}\Vert _\infty +\Vert (\varvec{I}_{\varvec{S} \varvec{S}}^n)^{-1}\Vert _{l_\infty }\lambda . \end{aligned}$$

For the second term, by the sample version of $\mathbf{A2 }$,

$$\begin{aligned} \Vert (\varvec{I}_{\varvec{S} \varvec{S}}^n)^{-1}\Vert _{l_\infty }\lambda =\Vert (\varvec{I}_{\varvec{S} \varvec{S}}^n)^{-1}\Vert _{l_1}\lambda \le \lambda \sqrt{k}\Vert (\varvec{I}_{\varvec{S} \varvec{S}}^n)^{-1}\Vert _{l_2}\le \frac{\lambda \sqrt{k}}{C_{\mathrm{min}}}. \end{aligned}$$

Now we bound the first term. Let

$$\begin{aligned} V_i=e'_i(\varvec{I}_{\varvec{S} \varvec{S}}^n)^{-1}\frac{1}{n}\varvec{w}'_{\varvec{S}}\varvec{\epsilon }. \end{aligned}$$

Conditioning on $\{\varvec{x}^i\}_{i=1}^n$ and $\{\varvec{y}^i_{\setminus j}\}_{i=1}^n$, and by condition $\mathbf{A2 }$ and the property of Gaussian distribution, $V_i$ is zero-mean Gaussian with variance at most

$$\begin{aligned} \frac{1}{n}\Vert (\varvec{I}_{\varvec{S} \varvec{S}}^n)^{-1}\Vert _{l_2}\le \frac{ 1}{nC_{\mathrm{min}}}. \end{aligned}$$

Consequently, by the Gaussian tail bound and the union bound,

$$\begin{aligned} {\mathbb {P}}({\max }_{i\in \varvec{S}}|V_i|>t\mid \{\varvec{x}^i\}_{i=1}^n, \{\varvec{y}^i_{\setminus j}\}_{i=1}^n )\le 2\,{\exp }\left( -\frac{t^2C_{\mathrm{min}}n}{2}+{\log }k\right) . \end{aligned}$$

Set $t=\frac{4\lambda }{\sqrt{C_{\mathrm{min}}}}$ and by the choice of $\lambda $, we have

$$\begin{aligned} \Vert \check{\varvec{\beta }}_{\varvec{S}}-\varvec{\beta }^0_{\varvec{S}}\Vert _{l_\infty }\le \lambda \left( \frac{4}{\sqrt{C_{\mathrm{min}}}}+\frac{\sqrt{k}}{C_{\mathrm{min}}}\right) , \end{aligned}$$

with probability larger than $1-2\,{\exp }(-C\lambda ^2 n)$ conditioning on $\{\varvec{x}^i\}_{i=1}^n$ and $\{\varvec{y}^i_{\setminus j}\}_{i=1}^n$. Similar to the statements in (14), we arrive the final bound of $\check{\varvec{\beta }}_{\varvec{S}}-\varvec{\beta }^0_{\varvec{S}}$. Further, by the choice of the minimal signal strength, i.e., ${\min }_{k,l}{|\beta ^0_{jkl}|}\ge C\frac{\lambda \sqrt{k}}{C_{\mathrm{min}}},$ then the estimator preserves sign consistency.

Finally, by Proposition 1 and the condition $\mathbf{A2 }$, $\hat{\varvec{\beta }}=(\check{\varvec{\beta }}_{\varvec{S}},\varvec{0})$ is the unique solution to the lasso problem (7). By the spirit of the PDW method, we arrive all the conclusions of Lemma 1. $\square $

Note that we have used the sample version of $\mathbf{A1 }$ and $\mathbf{A2 }$ to obtain the results of Theorem 1. We now provide the following Lemma 2 which says the sample version of $\mathbf{A1 }$ and $\mathbf{A2 }$ can be implied by the population version of $\mathbf{A1 }$ and $\mathbf{A2 }$ with high probability.

Lemma 2

If $\mathbf{A1 }$ and $\mathbf{A2 }$ holds for $\varvec{I}^0$, and $M_n=\Vert \varvec{X}\otimes \varvec{Y}_{\setminus j}\Vert _\infty <\infty , a.s.$, then for any $\nu >0$ and some fixed positive constant A, B and D,

$$\begin{aligned}&{\mathbb {P}}\left( \varLambda _{\mathrm{min}}(\varvec{I}_{\varvec{S} \varvec{S}}^n)\le C_{\mathrm{min}}-\nu \right) \le 2\,{\exp }\left( -A\frac{\nu ^2n}{M_n^2k^2}+B{\log }k\right) ,\\&{\mathbb {P}}\left( \varvec{I}_{\varvec{S}^c \varvec{S}}^n(\varvec{I}_{\varvec{S} \varvec{S}}^n)^{-1}\ge 1-\frac{\gamma }{2}\right) \le {\exp }\left( -D\frac{n}{M_n^2k^3}+{\log }p+{\log }q\right) . \end{aligned}$$

The proof of Lemma 2 is similar to that in Ravikumar et al. (2010), so we omit it here.

Proof of Theorem 1

We now employ Lemma 1 and Lemma 2 to verify Theorem 1. Let the results of Theorem 1 as event ${\mathcal {A}}$, and let

$$\begin{aligned} {\mathcal {B}}=\{M_n=\Vert \varvec{X}\otimes \varvec{Y}_{\setminus j}\Vert _\infty <\infty \}, \end{aligned}$$

and

$$\begin{aligned} {\mathcal {C}}=\left\{ \left[ \varvec{I}_{\varvec{S}^c \varvec{S}}^n(\varvec{I}_{\varvec{S} \varvec{S}}^n)^{-1}\le 1-{\gamma }\right] \cap \left[ \varLambda _{\mathrm{min}}(\varvec{I}_{\varvec{S} \varvec{S}}^n)\ge C_{\mathrm{min}}\right] \right\} . \end{aligned}$$

Then putting together all the pieces, we have

$$\begin{aligned}&{\mathbb {P}}({\mathcal {A}}^c)\le {\mathbb {P}}({\mathcal {A}}^c\mid {\mathcal {B}}\cap {\mathcal {C}})+{\mathbb {P}}(({\mathcal {B}}\cap {\mathcal {C}})^c)\\&\quad \le 2\,{\exp }(-C\lambda ^2n)+{\mathbb {P}}({\mathcal {B}}^c)+{\mathbb {P}}({\mathcal {C}}^c)\\&\quad \le 2\,{\exp }(-C\lambda ^2n)+2{\mathbb {P}}({\mathcal {B}}^c)+{\mathbb {P}}({\mathcal {C}}^c\cap {\mathcal {B}})\\&\quad \le 2\,{\exp }(-C\lambda ^2n)+2\,{\exp }(-M_n^\delta )\\&\quad \quad +2\,{\exp } \left( -C\frac{n}{M_n^2k^2}+C{\log }k\right) + {\exp }\left( -C\frac{n}{M_n^2k^3}+{\log }p+{\log }q\right) , \end{aligned}$$

where C is some constant and may be different from term to term. By the choice of n and $M_n$,

$$\begin{aligned} {\mathbb {P}}({\mathcal {A}}^c)\le C {\exp }(-C\lambda ^2n). \end{aligned}$$

The proof is completed. $\square $

Rights and permissions

Reprints and permissions

About this article

Cite this article

Guo, X., Zhang, H. Sparse directed acyclic graphs incorporating the covariates. Stat Papers 61, 2119–2148 (2020). https://doi.org/10.1007/s00362-018-1027-8

Download citation

Received: 07 January 2018
Revised: 23 July 2018
Published: 16 August 2018
Issue Date: October 2020
DOI: https://doi.org/10.1007/s00362-018-1027-8

Keywords

Mathematics Subject Classification

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Sparse directed acyclic graphs incorporating the covariates

Abstract

Access this article

Similar content being viewed by others

Penalized estimation of directed acyclic graphs from discrete data

An Overview of Recent Advancements in Causal Studies

LeCaSiM: Learning Causal Structure via Inverse of M-Matrices with Adjustable Coefficients

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendix

Lemma 1

Proposition 1

Proof of Lemma 1

Lemma 2

Proof of Theorem 1

Rights and permissions

About this article

Cite this article

Keywords

Mathematics Subject Classification

Navigation

Sparse directed acyclic graphs incorporating the covariates

Abstract

Access this article

Similar content being viewed by others

Penalized estimation of directed acyclic graphs from discrete data

An Overview of Recent Advancements in Causal Studies

LeCaSiM: Learning Causal Structure via Inverse of M-Matrices with Adjustable Coefficients

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendix

Appendix

Lemma 1

Proposition 1

Proof of Lemma 1

Lemma 2

Proof of Theorem 1

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation