Skip to main content
Log in

Sparse directed acyclic graphs incorporating the covariates

  • Regular Article
  • Published:
Statistical Papers Aims and scope Submit manuscript

Abstract

Directed acyclic graphs (DAGs) have been widely used to model the causal relationships among variables using multivariate data. However, covariates are often available together with these data which may influence the underlying causal network. Motivated by such kind of data, in this paper, we incorporate the covariates directly into the DAGs to model the dependency relationships among nodal variables. Specifically, the causal strengths are assumed to be a linear function of the covariates, which enhances the interpretability and flexibility of the model. We fit the model in the \(l_1\) penalized maximum likelihood framework and employ a coordinate descent based algorithm to solve the resulting optimization problem. The consistency of the estimator are also established under the regime where the order of nodal variables are known. Finally, we evaluate the performance of the proposed method through a series of simulations and a lung cancer data example.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16

Similar content being viewed by others

Notes

  1. http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE68571.

References

  • Aragam B, Zhou Q (2015) Concave penalized estimation of sparse Gaussian Bayesian networks. J Mach Learn Res 16:2273–2328

    MathSciNet  MATH  Google Scholar 

  • Barabási AL, Albert R (2002) Statistical mechanics of complex networks. Rev Mod Phys 74:47–97

    MathSciNet  MATH  Google Scholar 

  • Beer DG, Kardia SL, Huang CC, Giordano TJ, Levin AM, Misek DE, Lin L, Chen G, Gharib TG, Thomas DG, Lizyness ML, Kuick R, Hayasaka S, Taylor JM, Iannettoni MD, Orringer MB, Hanash S (2002) Gene-expression profiles predict survival of patients with lung adenocarcinoma. Nat Med 8:816–824

    Google Scholar 

  • Cai T, Liu W, Luo X (2011) A constrained \(l_1\) minimization approach to sparse precision matrix estimation. J Am Stat Assoc 106:594–607

    MATH  Google Scholar 

  • Cai T, Liu W, Xie J (2013) Covariate adjusted precision matrix estimation with an application in genetical genomics. Biometrika 100:139–156

    MathSciNet  MATH  Google Scholar 

  • Chen M, Zhao R, Zhao H, Zhou H (2016) Asymptotically normal and efficient estimation of covariate-adjusted Gaussian graphical model. J Am Stat Assoc 111:394–406

    MathSciNet  Google Scholar 

  • Cheng J, Levina E, Wang P, Zhu J (2014) A sparse ising model with covariates. Biometrics 70:943–953

    MathSciNet  MATH  Google Scholar 

  • Cooper GF, Herskovits E (1992) A Bayesian method for the induction of probabilistic networks from data. Mach Learn 9:309–347

    MATH  Google Scholar 

  • Dorton M, Maathuis M (2017) Structure learning in graphical modeling. Annu Rev Stat Appl 4:3.1–3.29

    Google Scholar 

  • Edwards D (2000) Introduction to graphical modelling. Springer, New York

    MATH  Google Scholar 

  • Friedman J, Hastie T, Höfling H, Tibshirani R (2007) Pathwise coordinate optimization. Ann Appl Stat 1:302–332

    MathSciNet  MATH  Google Scholar 

  • Friedman J, Hastie T, Tibshirani R (2008) Sparse inverse covariance estimation with the graphical lasso. Biostatistics 9:432–441

    MATH  Google Scholar 

  • Friedman J, Hastie T, Tibshirani R (2010) Regularization paths for generalized linear models via coordinate descent. J Stat Softw 33:1–22

    Google Scholar 

  • Fu F, Zhou Q (2013) Learning sparse causal Gaussian networks with experimental intervention: regularization and coordinate descent. J Am Stat Assoc 108:288–300

    MathSciNet  MATH  Google Scholar 

  • Gao B, Cui Y (2015) Learning directed acyclic graphical structures with genetical genomics data. Bioinformatics 31:3953–3960

    Google Scholar 

  • Ha MJ, Sun W, Xie J (2016) Penpc: a two-step approach to estimate the skeletons of high-dimensional directed acyclic graphs. Biometrics 72:146–155

    MathSciNet  MATH  Google Scholar 

  • Han SW, Chen G, Cheon MS, Zhong H (2016) Estimation of directed acyclic graphs through two-stage adaptive lasso for gene network inference. J Am Stat Assoc 111:1004–1019

    MathSciNet  Google Scholar 

  • Ising E (1925) Beitrag zur theorie der ferromagnetismus. Z Phys 31:253–258

    MATH  Google Scholar 

  • Kalisch M, Bühlmann P (2007) Estimating high-dimensional directed acyclic graphs with the pc-algorithm. J Mach Learn Res 8:613–636

    MATH  Google Scholar 

  • Lam W, Bacchus F (1994) Learning Bayesian belief networks: an approach based on the MDL principle. Comput Intel 10:269–293

    Google Scholar 

  • Lam C, Fan J (2009) Sparsistency and rates of convergence in large covariance matrices estimation. Ann Stat 37:4254–4278

    MATH  Google Scholar 

  • Lauritzen S (1996) Graphical models. Oxford University Press, Oxford

    MATH  Google Scholar 

  • Leng C, Tang CY (2012) Sparse matrix graphical models. J Am Stat Assoc 107:1187–1200

    MathSciNet  MATH  Google Scholar 

  • Leung D, Drton M, Hara H (2016) Identifiability of directed gaussian grpahical models with one latent source. Electron J Stat 10:394–422

    MathSciNet  MATH  Google Scholar 

  • Liang X, Young W, Huang L, Raftery A, Yeung K (2017) Integration of multiple data sources for gene network inference using genetic perturbation data. https://doi.org/10.1101/158394

  • Lin J, Basu S, Banerjee M, Michailidis G (2016) Penalized maximum likelihood estimation of multi-layered Gaussian graphical models. J Mach Learn Res 17:1–51

    MathSciNet  MATH  Google Scholar 

  • Liu H, Chen X, Lafferty J, Wasserman L (2010) Graph-valued regression. In: Proceedings of Advances in Neural Information Processing Systems, vol 23

  • Meinshausen N, Bühlmann P (2006) High-dimensional graphs with the lasso. Ann Stat 34:1436–1462

    MathSciNet  MATH  Google Scholar 

  • Ni Y, Stingo FC, Baladandayuthapani V (2017) Sparse multi-dimensional graphical models: a unified bayesian framework. J Am Stat Assoc 112:779–793

    MathSciNet  Google Scholar 

  • Pearl J (2000) Causality: models, reasoning, and inference. Cambridge University Press, Cambridge

    MATH  Google Scholar 

  • Peng J, Wang P, Zhou N, Zhu J (2009) Partial correlation estimation by joint sparse regression model. J Am Stat Assoc 104:735–746

    MathSciNet  MATH  Google Scholar 

  • Peters J, Bühlmann P (2014) Identifiability of Gaussian structural equation models with equal error variances. Biometrika 101:219–228

    MathSciNet  MATH  Google Scholar 

  • Ravikumar P, Wainwright MJ, Lafferty J (2010) High-dimensional ising model selection using \(l_1\)-regularized logistic regression. Ann Stat 38:1287–1319

    MATH  Google Scholar 

  • Ravikumar P, Raskutti G, Wainwright MJ (2011) High-dimensional covariance estimation by minimizing \(l_1\)-penalized log-determinant. Electron J Stat 5:935–980

    MathSciNet  MATH  Google Scholar 

  • Rothman AJ, Bickel PJ, Levina E, Zhu J (2008) Sparse permutation invariant covariance estimation. Electron J Stat 2:494–515

    MathSciNet  MATH  Google Scholar 

  • Shojaie A, Michailidis G (2010) Penalized likelihood methods for estimation of sparse high dimensional directed acyclic graphs. Biometrika 97:519–538

    MathSciNet  MATH  Google Scholar 

  • Shojaie A, Jauhiainen A, Kallitsis M, Michailidis G (2014) Inferring regulatory networks by combining perturbation screens and steady state gene expression profiles. PLoS ONE 9(e82):392

    Google Scholar 

  • Spirtes P, Glymour C, Scheines R (2000) Causation, prediction, and search. The MIT Press, Cambridge

    MATH  Google Scholar 

  • van de Geer S, Bühlmann P (2013) \(l_0\)-penalized maximum likelihood for sparse directed acyclic graphs. Ann Stat 41:536–567

    MATH  Google Scholar 

  • Wainwright MJ (2009) Sharp thresholds for high-dimensional and noisy sparsity recovery using \(l_1\) constrained quadratic programming (lasso). IEEE Trans Inf Theory 55:2183–2202

    MATH  Google Scholar 

  • Witten DM, Friedman JH, Simon N (2011) New insights and faster computations for the graphical lasso. J Comput Graph Stat 20:892–900

    MathSciNet  Google Scholar 

  • Wu T, Lange K (2008) Coordinate descent procedures for lasso penalized regression. Ann Appl Stat 2:224–244

    MathSciNet  MATH  Google Scholar 

  • Yin J, Li H (2011) A sparse conditional gaussian graphical model for analysis of genetical genomics data. Ann Appl Stat 5:2630–2650

    MathSciNet  MATH  Google Scholar 

  • Yuan M, Lin Y (2007) Model selection and estimation in the gaussian graphical model. Biometrika 94:19–35

    MathSciNet  MATH  Google Scholar 

  • Zhao P, Yu B (2006) On model selection consistency of lasso. J Mach Learn Res 7:2541–2567

    MathSciNet  MATH  Google Scholar 

  • Zhou S (2014) Gemini: graph estimation with matrix variate normal instances. Ann Stat 42:532–562

    MathSciNet  MATH  Google Scholar 

Download references

Acknowledgements

We are very grateful to the Editor, the Associate Editor and the two referees for their careful work and helpful comments which led to an improved version of this paper.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hai Zhang.

Additional information

This work was partially supported by National Natural Science Foundation of China under Grant Number 11571011.

Appendix

Appendix

In this section, we give the proof of Theorem 1. To simplify the notation, we drop j which indexes the \(j\hbox {th}\) problem. Define the sample version of \(\varvec{I}^0\) as

$$\begin{aligned} \varvec{I}^n=\frac{1}{n}\sum _{i=1}^n\left( \varvec{x}^i\otimes \varvec{y}^i_{\setminus j}\right) '\left( \varvec{x}^i\otimes \varvec{y}^i_{\setminus j}\right) . \end{aligned}$$

Sketch of Proof We break down the proofs into a sequence of lemmas. In what follows, we first prove the results when \(\mathbf{A1 }\) and \(\mathbf{A2 }\) are satisfied by \(\varvec{I}^n\), the sample version of \(\varvec{I}^0\), which is presented in Lemma 1. Then we prove that \(\mathbf{A1 }\) and \(\mathbf{A2 }\) hold for \(\varvec{I}^0\) implies their counterpart hold for \(\varvec{I}^n\) with high probability. And it is established in Lemma 2. The proof of Lemma 1 is based on a technique called primal-dual witness method (Wainwright 2009). It consists of constructing a pair \((\check{\varvec{\beta }},\check{\varvec{z}})\) following a specific sequence of steps. Generally, the \(\check{\varvec{\beta }}\) is constructed to be zero on the non-support \(\varvec{S}^c\) of the true \(\varvec{\beta }^0\), and the elements on the true support \(\varvec{S}\) are constructed according to (7) except that the solution is restricted on \(\varvec{S}\). \(\check{\varvec{z}}\) is built such that with high probability, \((\check{\varvec{\beta }},\check{\varvec{z}})\) together satisfy the optimality conditions of problem (7) as shown in the following Proposition 1. In this way, if the constructive procedure succeeds, \(\check{\varvec{\beta }}\) is the optimal solution to (7). Note that \(\check{\varvec{\beta }}_{\varvec{S}^c}=\varvec{0}\), and thus the sign consistency can be obtained by evaluating the \(l_\infty \) bound of \(\check{\varvec{\beta }}_{\varvec{S}}-\varvec{\beta }^0_{\varvec{S}}\). It should be kept in mind that the procedure is not a practical algorithm but a proof technique.

Lemma 1

If \(\mathbf{A1 }\) and \(\mathbf{A2 }\) hold for \(\varvec{I}^n\),

$$\begin{aligned} M_n=\Vert \varvec{X}\otimes \varvec{Y}_{\setminus j}\Vert _\infty <\infty , a.s. \end{aligned}$$

and

$$\begin{aligned} \lambda \ge \frac{2M_n}{\gamma }\sqrt{\frac{2({\log }p+{\log }q)}{n}}, \end{aligned}$$

then with probability larger than \(1-2\,{\exp }(-C\lambda ^2n)\), the results of Theorem 1 hold.

Before proving Lemma 1, we rewrite the Lemma 1 of Wainwright (2009) using our notation as the following Proposition 1.

Proposition 1

Denote \(\varvec{w}=(\varvec{x}^i\otimes \varvec{y}^i_{\setminus j})_{i=1,\ldots ,n}\) as the design matrix, then,

  1. (a)

    \(\hat{\varvec{\beta }}\) is the optimal solution to (7) if and only if there exists a subgradient vector \(\hat{\varvec{z}}\in \partial \Vert \hat{\varvec{\beta }}\Vert _1\) such that

    $$\begin{aligned} \frac{1}{n}\varvec{w}'\varvec{w}(\hat{\varvec{\beta }}-\varvec{\beta }^0)-\frac{1}{n}\varvec{w}'\varvec{\epsilon }+\lambda \hat{\varvec{z}}=0. \end{aligned}$$
    (12)
  2. (b)

    Assume that the subgradient vector satisfies the strict feasibility condition \(\hat{z}_l<1\) for all \(l\in \varvec{S}^c(\hat{\varvec{\beta }})\). Then any optimal solution \( \tilde{\varvec{\beta }}\) satisfies \( \tilde{\varvec{\beta }}_l=0\) for all \(l\in \varvec{S}^c(\hat{\varvec{\beta }})\), where \(\varvec{S}^c(\hat{\varvec{\beta }})\) stands for the non-support index set of \(\hat{\varvec{\beta }}\).

  3. (c)

    Under the conditions of (b), if \(\varvec{w}'_{\varvec{S}(\hat{\varvec{\beta }})}\varvec{w}_{\varvec{S}(\hat{\varvec{\beta }})}\) is invertible, then \(\hat{\varvec{\beta }}\) is the unique optimal solution of the lasso problem (7).

Proposition 1 combined with the condition \(\mathbf{A2 }\) will be used to show the uniqueness of the estimator. Now we start to prove Lemma 1.

Proof of Lemma 1

We use the primal-dual witness (PDW) method (Wainwright 2009) to prove the results, which consists of constructing a pair \((\check{\varvec{\beta }},\check{\varvec{z}})\) by the following four steps.

Step 1: Set \(\check{\varvec{\beta }}_{\varvec{S}^c}=\varvec{0}\) and construct \(\check{\varvec{\beta }}_{\varvec{S}}\) as the solution of the following restricted lasso problem,

$$\begin{aligned} \check{\varvec{\beta }}_{\varvec{S}}= \;\hbox {arg min}\;\left\{ \frac{1}{2n}\sum _{i=1}^n(y_j^i-(\varvec{x}^i\otimes \varvec{y}^i_{\setminus j})_{\varvec{S}}\varvec{\beta }_{\varvec{S}})^2+\lambda \Vert \varvec{\beta }_{\varvec{S}}\Vert _1\right\} . \end{aligned}$$
(13)

By the sample version of condition \(\mathbf{A2 }\), we know that the solution to (13) is unique.

Step 2: Choose \(\check{\varvec{z}}_{\varvec{S}}\) as one of the subdifferential of the \(l_1\) norm at \(\check{\varvec{\beta }}_{\varvec{S}}\), i.e., \(\check{\varvec{z}}_{\varvec{S}}\in \partial \Vert \check{\varvec{\beta }}_{\varvec{S}}\Vert _1 \).

Step 3: Set \(\check{\varvec{z}}_{\varvec{S}^c}\) as the solution to problem (12) and prove the strict dual feasibility condition \(|\check{\varvec{z}}_{l}|<1\) for all \(l\in \varvec{S}^c\), which ensures the optimality.

Step 4: Prove the sign consistency by establishing the \(l_\infty \) bound of \(\check{\varvec{\beta }}_{\varvec{S}}-\varvec{\beta }^0_{\varvec{S}}.\)

By Proposition 1, if the above four steps succeed, then \(\hat{\varvec{\beta }}=(\check{\varvec{\beta }}_{\varvec{S}},\varvec{0})\) is the unique solution to the lasso problem (7). Moreover, \(\hat{\varvec{\beta }}\) correctly identifies the support of the true \(\varvec{\beta }^0\) with the correct signs. For the success of the PDW, indeed, we only need to show Step 3 and Step 4 hold with high probability. Next, we prove them respectively.

We begin with verifying the strict dual feasibility. By the construction of PDW, for every \(l\in \varvec{S}^c\), it is easy to obtain

$$\begin{aligned} \check{\varvec{z}}_{l}=\varvec{w}'_l\left\{ \varvec{w}_{\varvec{S}}(\varvec{w}'_{\varvec{S}}\varvec{w}_{\varvec{S}})^{-1}\check{\varvec{z}}_{\varvec{S}}+\varPi _{\varvec{w}_{\varvec{S}}^\bot }\left( \frac{\varvec{\epsilon }}{\lambda n}\right) \right\} , \end{aligned}$$

where \(\varPi _{\varvec{w}_{\varvec{S}}^\bot }=\varvec{I}_{n\times n}-\varvec{w}_{\varvec{S}}(\varvec{w}'_{\varvec{S}}\varvec{w}_{\varvec{S}})^{-1}\varvec{w}'_{\varvec{S}}\) is a projection matrix, and \(\check{\varvec{z}}_{\varvec{S}}\) is the subgradient vector which is chosen as Step 2. Denote

$$\begin{aligned} \mu _l=\varvec{w}'_l\varvec{w}_{\varvec{S}}(\varvec{w}'_{\varvec{S}}\varvec{w}_{\varvec{S}})^{-1}\check{\varvec{z}}_{\varvec{S}}, \end{aligned}$$

and

$$\begin{aligned} \tilde{z}_l=\varvec{w}'_l\varPi _{\varvec{w}_{\varvec{S}}^\bot }\left( \frac{\varvec{\epsilon }}{\lambda n}\right) , \end{aligned}$$

then we can write

$$\begin{aligned} \check{ z}_{l}=\mu _l+\tilde{z}_l,\;\hbox {for}\;l\in \varvec{S}^c. \end{aligned}$$

We study \(\mu _l\) and \(\tilde{z}_l\) respectively. Since \(\check{\varvec{z}}_{\varvec{S}}\) is the subgradient vector of the \(l_1\) norm, \(\Vert \check{\varvec{z}}_{\varvec{S}}\Vert _\infty \le 1\). This combined with the sample version of the incoherence condition \(\mathbf{A1 }\) implies \(|\mu _l|\le 1-\gamma \). Then we turn to study \(\tilde{z}_l\). Conditioning on \(\{\varvec{x}^i\}_{i=1}^n\) and \(\{\varvec{y}^i_{\setminus j}\}_{i=1}^n\), and using the property of Gaussian distribution, we obtain that \(\tilde{z}_l\) is conditional Gaussian with variance at most

$$\begin{aligned} \frac{1}{\lambda ^2 n^2}\Vert \varvec{w}'_l\varPi _{\varvec{w}_{\varvec{S}}^\bot }\Vert _2^2\le \frac{1}{\lambda ^2 n^2}\Vert \varPi _{\varvec{w}_{\varvec{S}}^\bot }(\varvec{w}_l)\Vert _2^2\le \frac{M_n^2}{\lambda ^2n}, \end{aligned}$$

where we have used the condition \(M_n=\Vert \varvec{X}\otimes \varvec{Y}_{\setminus j}\Vert _\infty <\infty \) and the fact that the projection matrix \(\varPi _{\varvec{w}_{\varvec{S}}^\bot }\) has spectral norm one to derive the second inequality. Therefore, by the Gaussian tail bound and the union bound, we have

$$\begin{aligned} {\mathbb {P}}\left( {\max }_{l\in \varvec{S}^c}|\tilde{z}_l|\ge t\mid \{\varvec{x}^i\}_{i=1}^n, \{\varvec{y}^i_{\setminus j}\}_{i=1}^n\right) \le 2(pq-k) {\exp }\left( -\frac{\lambda ^2 nt^2}{2M_n^2}\right) , \end{aligned}$$

where t is any non-negative constant and recall that k is the maximum cardinality of \(\varvec{S}_j\). Let \(t=\frac{\gamma }{2},\) we have

$$\begin{aligned} {\mathbb {P}}\left( {\max }_{l\in \varvec{S}^c}|\tilde{z}_l|\ge \frac{\gamma }{2}\mid \{\varvec{x}^i\}_{i=1}^n, \{\varvec{y}^i_{\setminus j}\}_{i=1}^n\right) \le 2\,{\exp }\left( -\frac{\lambda ^2 n\gamma ^2}{8M_n^2}+{\log }(pq-k)\right) . \end{aligned}$$

Since \(\lambda \ge \frac{2M_n}{\gamma }\sqrt{\frac{2({\log }p +{\log }q)}{n}}\) and combing \(|\mu _l|\le 1-\gamma \), we then have

$$\begin{aligned} {\mathbb {P}}\left( {\max }_{l\in \varvec{S}^c}|z_l|\ge 1-\frac{\gamma }{2}\mid \{\varvec{x}^i\}_{i=1}^n, \{\varvec{y}^i_{\setminus j}\}_{i=1}^n\right) \le 2\,{\exp }(-C\lambda ^2 n). \end{aligned}$$

Finally, we arrive the conclusion by noting that

$$\begin{aligned}&{\mathbb {P}}\left( {\max }_{l\in \varvec{S}^c}|z_l|\ge 1-\frac{\gamma }{2}\right) =E\left( \mathbb I_{{\max }_{l\in \varvec{S}^c}|z_l|\ge 1-\frac{\gamma }{2}}\right) \nonumber \\&\quad =E_{\varvec{X},\varvec{Y}} \left( E\left( \mathbb I_{{\max }_{l\in \varvec{S}^c}|z_l|\ge 1-\frac{\gamma }{2}}\mid \{\varvec{x}^i\}_{i=1}^n, \{\varvec{y}^i_{\setminus j}\}_{i=1}^n\right) \right) \nonumber \\&\quad =E_{\varvec{X},\varvec{Y}} \left( {\mathbb {P}}\left( {\max }_{l\in \varvec{S}^c}|z_l|\ge 1-\frac{\gamma }{2}\mid \{\varvec{x}^i\}_{i=1}^n, \{\varvec{y}^i_{\setminus j}\}_{i=1}^n\right) \right) \nonumber \\&\quad \le E_{\varvec{X},\varvec{Y}}\left( 2\,{\exp }(-C\lambda ^2 n)=2\,{\exp }(-C\lambda ^2 n)\right) . \end{aligned}$$
(14)

Thus, we have proved Step 3.

Next we prove the sign consistency of the estimator in Step 4 by establishing the \(l_\infty \) bound of \(\check{\varvec{\beta }}_{\varvec{S}}-\varvec{\beta }^0_{\varvec{S}}.\) By (12) and (13), it is easy to obtain

$$\begin{aligned} \check{\varvec{\beta }}_{\varvec{S}}-\varvec{\beta }^0_{\varvec{S}}=\left( \frac{1}{n}\varvec{w}'_{\varvec{S}}\varvec{w}_{\varvec{S}}\right) ^{-1} \left( \frac{1}{n}\varvec{w}'_{\varvec{S}}\varvec{\epsilon }-\lambda \hbox {sign}(\check{\varvec{z}}_{\varvec{S}})\right) . \end{aligned}$$
(15)

Denote \(\varDelta _i=\varvec{e}'_i(\check{\varvec{\beta }}_{\varvec{S}}-\varvec{\beta }^0_{\varvec{S}})\) as the \(i\hbox {th}\) element of \(\check{\varvec{\beta }}_{\varvec{S}}-\varvec{\beta }^0_{\varvec{S}}\), where \(\varvec{e}_i\) is a vector with 1 at the \(i\hbox {th}\) position and 0 elsewhere. By the triangle inequality,

$$\begin{aligned} {\max }_{i\in \varvec{S}}|\varDelta _i|\le \Vert (\varvec{I}_{\varvec{S} \varvec{S}}^n)^{-1}\varvec{w}'_{\varvec{S}}\frac{\varvec{\epsilon }}{n}\Vert _\infty +\Vert (\varvec{I}_{\varvec{S} \varvec{S}}^n)^{-1}\Vert _{l_\infty }\lambda . \end{aligned}$$

For the second term, by the sample version of \(\mathbf{A2 }\),

$$\begin{aligned} \Vert (\varvec{I}_{\varvec{S} \varvec{S}}^n)^{-1}\Vert _{l_\infty }\lambda =\Vert (\varvec{I}_{\varvec{S} \varvec{S}}^n)^{-1}\Vert _{l_1}\lambda \le \lambda \sqrt{k}\Vert (\varvec{I}_{\varvec{S} \varvec{S}}^n)^{-1}\Vert _{l_2}\le \frac{\lambda \sqrt{k}}{C_{\mathrm{min}}}. \end{aligned}$$

Now we bound the first term. Let

$$\begin{aligned} V_i=e'_i(\varvec{I}_{\varvec{S} \varvec{S}}^n)^{-1}\frac{1}{n}\varvec{w}'_{\varvec{S}}\varvec{\epsilon }. \end{aligned}$$

Conditioning on \(\{\varvec{x}^i\}_{i=1}^n\) and \(\{\varvec{y}^i_{\setminus j}\}_{i=1}^n\), and by condition \(\mathbf{A2 }\) and the property of Gaussian distribution, \(V_i\) is zero-mean Gaussian with variance at most

$$\begin{aligned} \frac{1}{n}\Vert (\varvec{I}_{\varvec{S} \varvec{S}}^n)^{-1}\Vert _{l_2}\le \frac{ 1}{nC_{\mathrm{min}}}. \end{aligned}$$

Consequently, by the Gaussian tail bound and the union bound,

$$\begin{aligned} {\mathbb {P}}({\max }_{i\in \varvec{S}}|V_i|>t\mid \{\varvec{x}^i\}_{i=1}^n, \{\varvec{y}^i_{\setminus j}\}_{i=1}^n )\le 2\,{\exp }\left( -\frac{t^2C_{\mathrm{min}}n}{2}+{\log }k\right) . \end{aligned}$$

Set \(t=\frac{4\lambda }{\sqrt{C_{\mathrm{min}}}}\) and by the choice of \(\lambda \), we have

$$\begin{aligned} \Vert \check{\varvec{\beta }}_{\varvec{S}}-\varvec{\beta }^0_{\varvec{S}}\Vert _{l_\infty }\le \lambda \left( \frac{4}{\sqrt{C_{\mathrm{min}}}}+\frac{\sqrt{k}}{C_{\mathrm{min}}}\right) , \end{aligned}$$

with probability larger than \(1-2\,{\exp }(-C\lambda ^2 n)\) conditioning on \(\{\varvec{x}^i\}_{i=1}^n\) and \(\{\varvec{y}^i_{\setminus j}\}_{i=1}^n\). Similar to the statements in (14), we arrive the final bound of \(\check{\varvec{\beta }}_{\varvec{S}}-\varvec{\beta }^0_{\varvec{S}}\). Further, by the choice of the minimal signal strength, i.e., \({\min }_{k,l}{|\beta ^0_{jkl}|}\ge C\frac{\lambda \sqrt{k}}{C_{\mathrm{min}}},\) then the estimator preserves sign consistency.

Finally, by Proposition 1 and the condition \(\mathbf{A2 }\), \(\hat{\varvec{\beta }}=(\check{\varvec{\beta }}_{\varvec{S}},\varvec{0})\) is the unique solution to the lasso problem (7). By the spirit of the PDW method, we arrive all the conclusions of Lemma 1. \(\square \)

Note that we have used the sample version of \(\mathbf{A1 }\) and \(\mathbf{A2 }\) to obtain the results of Theorem 1. We now provide the following Lemma 2 which says the sample version of \(\mathbf{A1 }\) and \(\mathbf{A2 }\) can be implied by the population version of \(\mathbf{A1 }\) and \(\mathbf{A2 }\) with high probability.

Lemma 2

If \(\mathbf{A1 }\) and \(\mathbf{A2 }\) holds for \(\varvec{I}^0\), and \(M_n=\Vert \varvec{X}\otimes \varvec{Y}_{\setminus j}\Vert _\infty <\infty , a.s.\), then for any \(\nu >0\) and some fixed positive constant A, B and D,

$$\begin{aligned}&{\mathbb {P}}\left( \varLambda _{\mathrm{min}}(\varvec{I}_{\varvec{S} \varvec{S}}^n)\le C_{\mathrm{min}}-\nu \right) \le 2\,{\exp }\left( -A\frac{\nu ^2n}{M_n^2k^2}+B{\log }k\right) ,\\&{\mathbb {P}}\left( \varvec{I}_{\varvec{S}^c \varvec{S}}^n(\varvec{I}_{\varvec{S} \varvec{S}}^n)^{-1}\ge 1-\frac{\gamma }{2}\right) \le {\exp }\left( -D\frac{n}{M_n^2k^3}+{\log }p+{\log }q\right) . \end{aligned}$$

The proof of Lemma 2 is similar to that in Ravikumar et al. (2010), so we omit it here.

Proof of Theorem 1

We now employ Lemma 1 and Lemma 2 to verify Theorem 1. Let the results of Theorem 1 as event \({\mathcal {A}}\), and let

$$\begin{aligned} {\mathcal {B}}=\{M_n=\Vert \varvec{X}\otimes \varvec{Y}_{\setminus j}\Vert _\infty <\infty \}, \end{aligned}$$

and

$$\begin{aligned} {\mathcal {C}}=\left\{ \left[ \varvec{I}_{\varvec{S}^c \varvec{S}}^n(\varvec{I}_{\varvec{S} \varvec{S}}^n)^{-1}\le 1-{\gamma }\right] \cap \left[ \varLambda _{\mathrm{min}}(\varvec{I}_{\varvec{S} \varvec{S}}^n)\ge C_{\mathrm{min}}\right] \right\} . \end{aligned}$$

Then putting together all the pieces, we have

$$\begin{aligned}&{\mathbb {P}}({\mathcal {A}}^c)\le {\mathbb {P}}({\mathcal {A}}^c\mid {\mathcal {B}}\cap {\mathcal {C}})+{\mathbb {P}}(({\mathcal {B}}\cap {\mathcal {C}})^c)\\&\quad \le 2\,{\exp }(-C\lambda ^2n)+{\mathbb {P}}({\mathcal {B}}^c)+{\mathbb {P}}({\mathcal {C}}^c)\\&\quad \le 2\,{\exp }(-C\lambda ^2n)+2{\mathbb {P}}({\mathcal {B}}^c)+{\mathbb {P}}({\mathcal {C}}^c\cap {\mathcal {B}})\\&\quad \le 2\,{\exp }(-C\lambda ^2n)+2\,{\exp }(-M_n^\delta )\\&\quad \quad +2\,{\exp } \left( -C\frac{n}{M_n^2k^2}+C{\log }k\right) + {\exp }\left( -C\frac{n}{M_n^2k^3}+{\log }p+{\log }q\right) , \end{aligned}$$

where C is some constant and may be different from term to term. By the choice of n and \(M_n\),

$$\begin{aligned} {\mathbb {P}}({\mathcal {A}}^c)\le C {\exp }(-C\lambda ^2n). \end{aligned}$$

The proof is completed. \(\square \)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Guo, X., Zhang, H. Sparse directed acyclic graphs incorporating the covariates. Stat Papers 61, 2119–2148 (2020). https://doi.org/10.1007/s00362-018-1027-8

Download citation

  • Received:

  • Revised:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00362-018-1027-8

Keywords

Mathematics Subject Classification

Navigation