A constrained \(\ell \)1 minimization approach for estimating multiple sparse Gaussian or nonparanormal graphical models
Abstract
Identifying context-specific entity networks from aggregated data is an important task, arising often in bioinformatics and neuroimaging applications. Computationally, this task can be formulated as jointly estimating multiple different, but related, sparse undirected graphical models (UGM) from aggregated samples across several contexts. Previous joint-UGM studies have mostly focused on sparse Gaussian graphical models (sGGMs) and can’t identify context-specific edge patterns directly. We, therefore, propose a novel approach, SIMULE (detecting Shared and Individual parts of MULtiple graphs Explicitly) to learn multi-UGM via a constrained \(\ell \)1 minimization. SIMULE automatically infers both specific edge patterns that are unique to each context and shared interactions preserved among all the contexts. Through the \(\ell \)1 constrained formulation, this problem is cast as multiple independent subtasks of linear programming that can be solved efficiently in parallel. In addition to Gaussian data, SIMULE can also handle multivariate Nonparanormal data that greatly relaxes the normality assumption that many real-world applications do not follow. We provide a novel theoretical proof showing that SIMULE achieves a consistent result at the rate \(O(\log (Kp)/n_{tot})\). On multiple synthetic datasets and two biomedical datasets, SIMULE shows significant improvement over state-of-the-art multi-sGGM and single-UGM baselines (SIMULE implementation and the used datasets @https://github.com/QData/SIMULE).
Keywords
Graphical model Multi-task learning Computational biologyList of symbols
- \(\varSigma \)
Covariance matrix
- \(\mu \)
Mean vector in the Gaussian distribution
- \(\varOmega \)
Precision matrix
- \(\mathbf {X}\)
Data sample matrix
- \(X_j\)
A random variable follows the Gaussian Distribution
- \(\varSigma ^{(i)}\)
ith Covariance matrix
- \(\varOmega ^{(i)}\)
ith Precision matrix in a multi-task setting
- \(\varOmega _S\)
Shared pattern among all precision matrices in a multi-task setting
- \(\varOmega ^{(i)}_I\)
Individual part of ith Precision matrix in a multi-task setting
- \(\mathbf {X}^{(i)}\)
ith data sample matrix in a multi-task setting
- \(n_i\)
Number of samples in ith data matrix
- \(n_{tot}\)
Total number of samples in a multi-task setting
- \(\beta \)
A column of \(\varOmega \)
- \(\beta ^{(i)}\)
A column of \(\varOmega ^{(i)}\)
- \(\mathbf {x}\)
A p-dimensional sample
- \(\mathbf {e}_j \)
\( (0,\ldots ,1,\ldots ,0)^T\)
- \(\theta \)
A parameter used in linear programming formulation [Eq. (10)], \([\beta ^{(1)^T}, \ldots , \beta ^{(i)^T},\ldots ,\beta ^{(K)^T},\varepsilon K{(\beta ^s)}^T]^T\)
- \(\mathbf {A}^{(i)}\)
A parameter used in linear programming formulation [Eq. (10)], \([0,\ldots ,0,\varSigma ^{(i)},0,\ldots ,0,\frac{1}{\varepsilon K}\varSigma ^{(i)}]\)
- \(\varvec{b}\)
A parameter used in linear programming formulation [Eq. (10)]. Equals to \(\mathbf {e}_j\)
- Z
A random variable follows the nonparanormal distribution
- \(\mathbf {S}\)
Correlation matrix of Z
- \(\varSigma _{tot}\)
Defined in Sect. 5.1
- \(\varOmega _I\)
Defined in Sect. 5.1
- \(\varOmega _S^{tot}\)
Defined in Sect. 5.1
- \(\varOmega _{tot}\)
Defined in Sect. 5.1
- \(X_{tot}\)
Defined in Sect. 5.1
- \(I_K\)
Defined in Sect. 5.1
- \(\sigma _{ij}\)
An entry of \(\varSigma _{tot}\)
- \(\omega _{ij}\)
An entry of \(\varOmega _{tot}\)
- \(\omega _{j}\)
A column of \(\varOmega _{tot}\)
- \(\hat{\varOmega }_{tot}\)
Estimated \(\varOmega _{tot}\)
- \(\varOmega ^0_{tot}\)
True \(\varOmega _{tot}\)
- \(\omega ^0_j\)
A column of \(\varOmega ^0_{tot}\)
- \(\hat{\varSigma }_{tot}\)
Estimated \(\varSigma _{tot}\)
- \(\varSigma ^0_{tot}\)
True \(\varSigma _{tot}\)
- \(\hat{\omega }_{j}\)
A column of \(\hat{\varOmega }_{tot}\)
- \(\hat{\varOmega }^1_{tot}\)
Solution of Eq. (15)
- \(\lambda _n\)
Hyper-parameter for the sparsity in Eq. (7)
- \(\varepsilon \)
Hyper-parameter balancing shared and individual part in Eq. (6)
- K
Total number of tasks
- p
Total number of features
- C
- q
A constant between 0 and 1 [Eq. (19)]
- \(\eta \)
A constant between 0 and 0.25 in condition (C1) [Eq. (17)]
- \(\gamma ,\delta \)
Two constants in condition (C2) [Eq. (18)]
- M
A constant represents the upper bound in Eq. (19)
- \(s_0(p)\)
A constant represents the sparsity level of \(\varOmega \) in Eq. (19)
- \(\hat{\mathbf {B}}, \hat{\mathbf {B}}_I, \hat{\mathbf {B}}_S\)
Solution of Eq. (9)
- \(\hat{\varvec{b}}^{(i)}_j\)
A column of \(\hat{\mathbf {B}}_I\)
- \(\hat{\varvec{b}}_j^S\)
A column of \(\hat{\mathbf {B}}_I\)
- \(\tau _0\)
A constant in Theorem 4
- \(C_4,C_5\)
Constants in Theorem 3
- \(C_0,C_1\)
Constants in Theorem 4(a)
- \(\theta _0 \)
Equals to \(\max \nolimits _{i,j,k} \hat{\varSigma }_{j,k}^{(i)}\)
- \(C_2,C_3\)
Constants in Theorem 4(b)
- \(\hat{\sigma }_{ij}\)
An entry of \(\hat{\varSigma }_{tot}\)
- \(\sigma ^0_{ij}\)
An entry of \(\varSigma _{tot}^0\)
- \(C_{K1},C_{K2}\)
Two constants used in the proof of Theorem 4(b)
- \(\mathbf {h}_j \)
\( \hat{\omega }_j - \omega ^0_j\)
- \(\mathbf {h}_j^1 \)
\( (\hat{\omega }_{ij}I\{ |\hat{\omega }_{ij}|\ge 2t_n \}; 1\le i \le p)^T - \omega _j^0\)
- \(\mathbf {h}_j^2 \)
\( \mathbf {h}_j - \mathbf {h}_j^1\)
- \(\bar{\mathbf {Y}}_{kij} \)
\( \mathbf {X}_{ki}\mathbf {X}_{kj}I\{|\mathbf {X}_{ki}\mathbf {X}_{kj}|\le \sqrt{n_{tot}/(\log Kp)^3}\} - {\mathbb E}\mathbf {X}_{ki}\mathbf {X}_{kj}I\{|\mathbf {X}_{ki}\mathbf {X}_{kj}|\le \sqrt{n_{tot}/(\log Kp)^3}\}\)
- \(\check{\mathbf {Y}}_{kij} \)
\( \mathbf {Y}_{kij} - \bar{\mathbf {Y}}_{kij}\)
- \(b_n \)
\( \max _{i,j}{\mathbb E}|\mathbf {X}_{ki}\mathbf {X}_{kj}|I\{|\mathbf {X}_{ki}\mathbf {X}_{kj}|\le \sqrt{n_{tot}/(\log Kp)^3}\}\)
- \(\mathbf {Y}_{kij} \)
\( {\mathbf {X}_{tot}}_{ki}{\mathbf {X}_{tot}}_{kj} - {\mathbb E}{\mathbf {X}_{tot}}_{ki}{\mathbf {X}_{tot}}_{kj}\)
- \(\varPhi ^{(i)} \)
The inverse of ith correlation matrix for the nonparanormal case
1 Introduction
Undirected graphical models (UGMs) provide a powerful tool for understanding statistical relationships among random variables. In a typical setting, we can represent the conditional dependency patterns among p random variables \(\{ X_1,\ldots ,X_p \}\) using an undirected graph \(G=(V,E)\). V includes p nodes corresponding to the p variables. E denotes the set of edges describing conditional dependencies among the variables \(\{X_1, \ldots , X_p\}\). If a pair of random variables is conditionally dependent given the rest of variables, there exists an edge in E connecting the corresponding pair of nodes in V; otherwise, the edge is absent. Within the graphical model framework, the task of estimating such undirected graphs based on a set of observed data samples is called structure estimation or model selection. Much of the related literature has focused on estimating G from a given data matrix \(\mathbf {X}_{n\times p}\) (with n observations across p random variables) that are independently and identically drawn from \(N_p(\mu , \varSigma )\). Here \(N_p(\mu , \varSigma )\) represents a multivariate Gaussian distribution with mean vector \(\mu \) (\(\mu \in \mathbb {R}^p\)) and covariance matrix \(\varSigma \) (with size \(p \times p\)). Using the aforementioned G to describe pairwise dependencies among p variables for such a multivariate Gaussian distribution is called a Gaussian graphical model (GGM, e.g., Lauritzen 1996; Mardia et al. 1980). The inverse of the covariance matrix is called the precision matrix, \(\varOmega := \varSigma ^{-1}\). Interestingly, GGM’s conditional independence pattern corresponds to zeros of \(\varOmega \). This means that an edge does not connect ith node and jth node (i.e., conditionally independent) in GGM if and only if \(\varOmega _{ij} = 0\).
To achieve a consistent estimation of G, assumptions are usually imposed on the structure of \(\varOmega \). Most commonly, the graph sparsity assumption has been introduced by various estimators to derive sparse GGM (sGGM). The graph sparsity assumption corresponds to limiting the number of non-zero entries in the precision matrix \(\varOmega \), which leads to a combinational problem for structure estimation. Many classic GGM estimators use the \(\ell _1\)-norm to create convex relaxation of the combinatorial formulation. For instance, the popular estimator “graphical lasso” (GLasso) has considered maximizing a \(\ell _1\)-penalized log-likelihood objective (Yuan and Lin 2007; Banerjee et al. 2008; Hastie et al. 2009; Rothman et al. 2008). More recently, Cai et al. (2011) proposed a constrained \(\ell _1\)-minimization formulation for estimating \(\varOmega \), known as the CLIME estimator. CLIME can be solved through column-wise linear programming and has shown favorable theoretical properties. Moreover, the nonparanormal graphical models (NGM) recently proposed by Liu et al. (2012) have extended sGGM to new distribution families. Both GGM and NGM belong to the general family of UGM (reviewed in Sect. 4).
This paper focuses on the problem of jointly estimating K undirected graphical models from K related multivariate sample blocks. Each sample block contains a different set of data observations on the same set of variables. This task is motivated by the fact that the past decade has seen a revolution in collecting large-scale heterogeneous data from many scientific fields like genetics and brain science. For instance, genomic technologies have delivered fast and accurate molecular profiling data across many cellular contexts (e.g. cell lines or cell stages) (ENCODE Project Consortium 2011). Many neuroimaging studies have collected functional measurements of brain regions across a cohort of multiple human subjects (Di Martino et al. 2014). Such networks can be concerned with identifying subject-specific variations across a population, where each individual is a unique context. For this type of data, understanding and quantifying context-specific variations across multiple graphs is a fundamental analysis task. Figure 1 provides a simple illustration (with two contexts) of the target problem. Interaction patterns that are activated only under a specific context can help to understand or to predict the importance of such a context (Ideker and Krogan 2012; Kelly et al. 2012).
Prior approaches for estimating UGMs from such heterogeneous data tend to either only estimate pairwise differential patterns between two graphs or jointly estimate multiple sGGMs toward a common graph pattern (reviewed in Sect. 4). The former strategy does not exploit the shared network structure across contexts and is not applicable for more than two contexts, leading to undesirable effects on the quality of the estimated networks. Conversely, the latter approach underestimates the network variability and makes implicit assumptions to minimize inter-context differences which are difficult to justify in practice. This is partly caused by the fact that relevant studies have mostly extended the “graphical lasso” (GLasso) estimator to multi-task settings and followed a penalized log-likelihood formulation [Eq. (2)]. Under the GLasso framework, however, explicitly quantifying the context-specific substructures involves a very challenging optimization task (explained in detail in Sect. 4).
Novel model Using a constrained \(\ell \)1 optimization strategy (Sect. 2), SIMULE extends CLIME to a multi-task setting. The learning step is solved efficiently through a formulation of multiple independent sub-problems of linear programming (Sect. 2.4) for which we also provide a parallel version of the learning algorithm. Compared with previous multi-task sGGM models, SIMULE can accurately quantify task-specific network variations that are unique for each task. This also leads to a better generalization and benefits all the involved tasks.
Novel extension Furthermore, since most real-world datasets do not follow the normality assumption, we extend SIMULE to Nonparanormal SIMULE (NSIMULE in Sect. 3) by learning multiple NGM under the proposed \(\ell \)1 constrained minimization. NSIMULE can deal with non-Gaussian data that follow the nonparanormal distribution (Liu et al. 2009), a much more generic data distribution family. Fitting NSIMULE is computationally as efficient as SIMULE.
Theoretical convergence rate In Sect. 5, we theoretically prove that SIMULE and NSIMULE achieve a consistent estimation of the target (true) dependency graphs with a high probability at the rate \(O(\log (Kp)/n_{tot})\). Here \(n_{tot}\) represents the total number of samples from all tasks and K describes the number of tasks. This proof also theoretically validates the benefit of learning multiple sGGMs jointly (Sect. 7), since the \(O(\log (Kp)/n_{tot})\) convergence rate is better than learning multiple single-sGGMs separately at rate \(O(\log p/n_i)\). \(n_{i}\) represents the number of samples of ith task. Such an analysis hasn’t been provided in any of the previous multi-sGGM studies.
Performance improvement In Sect. 6 we show a strong improved performance of SIMULE and NSIMULE over multiple baseline methods on multiple synthetic datasets and two real-world multi-cell biomedical datasets. The proposed methods obtain better AUC and partial AUC scores on all simulated cases. On two real-world datasets, our methods find the most matches of variable interactions when using existing BioMed-databases for validation.
2 Method: SIMULE
Learning multiple UGM jointly is a task of interests in many applications. This paper tries to model and learn context-specific graph variations explicitly, because such variations can “fingerprint” important markers for fields like cognition (Ideker and Krogan 2012), physiology (Kelly et al. 2012) or pathology (Ideker and Krogan 2012; Kelly et al. 2012). We consider the general case of estimating K graphical models from a p-dimensional aggregated dataset in the form of K different data blocks.
In what follows, plain letters denote scalars. Uppercase and lowercase bold letters denote matrices and vectors respectively.^{1} We denote \(\mathbf {X}^{(i)}_{n_i \times p}\) as the ith data block (or data matrix). The total number of data samples uses notation \(n_{tot} =\sum \nolimits _{i=1}^{K}n_i\). The precision matrix uses notation \(\varOmega \) and the covariance matrix uses notation \(\varSigma \). We denote the correlation matrix as S and the inverse of correlation matrix as \(\varPhi \). The vector \(\mathbf {e}_j = (0,\ldots ,1,\ldots ,0)^T\) denotes a basis vector in which only the jth entry is 1 and the rest are 0. For a p-dimensional data vector \(\mathbf {x}= (x_1, x_2, \ldots , x_p)^T \in \mathbb {R}^p\), let \(||\mathbf {x}||_1 = \sum \nolimits _i|x_i|\) be the \(\ell _1\)-norm of \(\mathbf {x}\), \(||\mathbf {x}||_{\infty } = \max \nolimits _i|x_i|\) be the \(\ell _{\infty }\)-norm of \(\mathbf {x}\) and \(||\mathbf {x}||_2 = \sqrt{\sum \nolimits _i x_i^2}\) be the \(\ell _2\)-norm of\(\mathbf {x}\). Similarly, for a matrix \(\mathbf {X}\), let \(||\mathbf {X}||_1 = \sum \nolimits _{i,j}|\mathbf {X}_{i,j}|\) be the \(\ell _1\)-norm of \(\mathbf {X}\) and \(||\mathbf {X}||_{\infty } = \max \nolimits _{i,j}|\mathbf {X}_{i,j}|\) be the \(\ell _{\infty }\)-norm of \(\mathbf {X}\). \(||\mathbf {X}||_2 = \sqrt{\lambda _{\max }(\mathbf {X})}\), here \(\lambda _{\max }\) is the largest eigenvalue of \(\mathbf {X}\). \(||\mathbf {X}||_F = \sqrt{\sum \nolimits _{i,j}\mathbf {X}_{i,j}^2}\) is the F-norm of \(\mathbf {X}\). \(||\mathbf {X}||_\mathbf{1 } = \max \nolimits _{j} \sum \nolimits _i |\mathbf {X}_{ij}|\) is the matrix \(\mathbf 1 \)-norm of \(\mathbf {X}\). \(||\mathbf {X}^{(1)}, \mathbf {X}^{(2)}, \ldots , \mathbf {X}^{(K)}||_{1,p} = \sum \nolimits _{i}||\mathbf {X}^{(i)}||_p\) is the \(\ell _{1,p}\)-norm of \((\mathbf {X}^{(1)}, \mathbf {X}^{(2)}, \ldots , \mathbf {X}^{(K)})\). \(\varOmega \succ 0\) means that \(\varOmega \) is a positive definite matrix. \(||(\varOmega ^{(1)}, \varOmega ^{(2)}, \ldots , \varOmega ^{(K)})||_{\mathcal {G},2} = \sum \nolimits _{j = 1}^p \sum \nolimits _{k = 1}^p ||(\varOmega _{j,k}^{(1)}, \varOmega _{j,k}^{(2)}, \ldots , \varOmega _{j,k}^{(i)}, \ldots , \varOmega _{j,k}^{(K)})||_2\). \(||(\varOmega ^{(1)}, \varOmega ^{(2)}, \ldots , \varOmega ^{(K)})||_{\mathcal {G},\infty } = \sum \nolimits _{j = 1}^p \sum \nolimits _{k = 1}^p ||(\varOmega _{j,k}^{(1)}, \varOmega _{j,k}^{(2)}, \ldots , \varOmega _{j,k}^{(i)}, \ldots , \varOmega _{j,k}^{(K)})||_\infty \).
2.1 Background: CLIME for estimating sparse Gaussian graphical model
2.2 Background: multi-task learning with task-shared and task-specific parameters
2.3 SIMULE: infer Shared and Individual parts of MULtiple sGGM Explicitly
Treating sparse GGM estimation from each data block as a single task, our main goal is to learn multiple sGGMs over K tasks jointly, which can lead to a better generalization across all of the involved tasks (theoretically proven in Sect. 5).
2.4 Optimization
To solve Eq. (11), we follow the primal dual interior method (Boyd and Vandenberghe 2004) that has also been used in the Dantzig selector for the task of regression (Candes and Tao 2007). Other strategies can be used to solve this linear programming, such as the one used in Pang et al. (2014).
2.5 Parallel version of SIMULE
Algorithm 1 can easily be revised into a parallel version. Essentially we just need to revise the “For loop” of step 8 in Algorithm 1 into, for instance, “column per machine” or “column per core”. Since the calculation of each column is independent from the other columns, this parallel variation will obtain the same solution as Algorithm 1 at a better speed. Section 6 shows the speed improvements of SIMULE in a multi-core setting versus a single-core setting.
3 Method variation: nonparanormal SIMULE (NSIMULE)
Though sGGM is powerful, its normality assumption is commonly violated in real applications. For instance, for the TF ChIP-Seq data analyzed in Sect. 6.6, the histogram of one of its TF variables is clearly not following Gaussian distribution (across samples, shown as the right distribution graph in Fig. 2). After a univariate log-transformation of the same feature, we obtain its distribution histogram as the left graph in Fig. 2. The transformed data samples are approximately normally distributed. This motivates us to adopt a more generalized UGM (recently proposed in Liu et al. 2009) to overcome the limitation of sGGM. This so-called “nonparanormal graphical model” (Liu et al. 2009) assumes that data samples follow a multivariate nonparanormal distribution, which is a strict superset of the Gaussian distribution. We extend SIMULE to the nonparanormal family and name this novel variation NSIMULE. NSIMULE learns to fit multiple NGMs jointly through modeling task-shared and task-specific parameters explicitly.
3.1 Background: nonparanormal graphical model
3.2 Background: estimate \(\mathbf {S}\) through rank-based measures of correlation matrix \(\mathbf {S}_0\)
Since the direct estimation of covariance matrix \(\mathbf {S}\) is difficult in nonparanormal distribution, recent studies have proposed an efficient nonparametric estimator (Liu et al. 2009) for \(\mathbf {S}\). This estimator is derived from the correlation matrix \(\mathbf {S}_0\). Because the covariance matrix \(\mathbf {S}= diag(\mathbf {S}_i) \mathbf {S}_0 diag(\mathbf {S}_i)\), \(\mathbf {S}^{-1} = diag(\mathbf {S}_i)^{-1} \mathbf {S}_0^{-1} diag(\mathbf {S}_i)^{-1}\). Here \(\mathbf {S}_i = \sqrt{Cov(Z_i,Z_i)}\) and \(diag(\mathbf {S}_i) = diag(\mathbf {S}_1,\mathbf {S}_2,\ldots ,\mathbf {S}_p)\). Therefore, the inverse of correlation matrix (\(\mathbf {S}_0^{-1}\)) and the inverse of covariance matrix (\(\mathbf {S}^{-1}\)) have the same nonzero and zero entries. Based on this observation, Liu et al. (2009) proposed a nonparametric method to estimate the correlation matrix \(\mathbf {S}_0\), instead of estimating the covariance matrix \(\mathbf {S}\) for the purpose of structure inference.
In Liu et al. (2009) the authors proposed using the population Kendall’s tau correlation coefficients \(\tau _{jk}\) to estimate \(\mathbf {S}_0\), based upon the explicit relationship between this rank-based measure \(\tau _{jk}\) and the correlation measure \((\mathbf {S}_{jk})_0\) for a given nonparanormal dataset \(Z \sim NPN_{p}(\mu , \mathbf {S},f_{1},\ldots ,f_{p})\) (discussed in Liu et al. 2012). Figure 2 presents the simple relationship between \(Z \sim NPN_{p}(\mu ,S;f_{1},\ldots ,f_{p})\) and its latent \(X \sim N(\mu ,S)\). To simplify notations, we use \(\mathbf {S}\) to represent the correlation matrix for the remainder of this paper.
Theorem 1
Proof
The proof is provided in Liu et al. (2009). \(\square \)
3.3 NSIMULE: nonparanormal SIMULE
We can now substitute each sample covariance matrix \(\hat{\varSigma }^{(i)}\) used in Eq. (9) from each task with its corresponding correlation matrix \(\mathbf {S}^{(i)}\) as estimated above. The rest of the computations are the same as SIMULE. We refer to this whole process as NSIMULE. It estimates multiple different, but related, sparse Nonparametric Graphical Models (sNGM) through shared and task-specific parameter representations.
Theorem 2
If X, Y are two independent random variables and f,g\(:\mathbb {R}\rightarrow \mathbb {R}\) are two measurable functions, then f(X) and g(Y) are also independent.
Through the above theorem, the monotone functions f in \(NPN_{p}\) will not change the conditional dependency among variables. As proved in Liu et al. (2009), the conditional dependency network among the latent Gaussian variables X (in this \(NPN_{p}\)) is the same as the conditional dependency network among the nonparanormal variables \(Z_i\), with a parametric asymptotic convergence rate. Therefore, we can use the estimated correlation matrices \(\mathbf {S}^{(i)}\) for the joint network inference of multiple sNGMs in SIMULE. This is also because we have shown that the inverse of the correlation matrix and the inverse of the covariance matrix share the same nonzero and zero patterns.
4 Related work
4.1 Connecting to past multi-sGGM studies
Sparse GGM is an extremely active topic in the recent literature including notable studies like Wainwright and Jordan (2006) and Banerjee et al. (2008). We can categories single-task sGGM estimators into three groups: (a) penalized likelihood (GLasso), (b) neighborhood approach and (c) CLIME estimator.
We choose the most relevant three studies as our baselines in the experiments: (a) Fused Joint graphical lasso (JGL-fused) (Danaher et al. 2013), (b) Group Joint graphical lasso (JGL-group) (Danaher et al. 2013) and (c) SIMONE (Chiquet et al. 2011). JGL-fused and JGL-group are based on the popular “graphical lasso” estimator (Friedman et al. 2008; Yuan and Lin 2007); [using \(L(\varOmega ) =(\log det(\varOmega ) - <\varSigma , \varOmega >)\) in Eq. (13)]. SIMONE (Chiquet et al. 2011) follows neighborhood-selection based estimator. It can be viewed as using a pseudo-likelihood approximation instead of the full likelihood as \(L(\varOmega )\) in Eq. (13).^{2}
A list of representative multi-sGGM methods and the second penalty functions they have used
References | Penalty function \(P(\varOmega ^{(1)}, \varOmega ^{(2)}, \ldots ,\varOmega ^{(K)}) =\) | |
---|---|---|
(1) | JGL-Fused (Danaher et al. 2013) | \( \sum \nolimits _{ij,i>j}||\varOmega ^{(i)} - \varOmega ^{(j)}||_1\) |
(2) | JGL-Group (Danaher et al. 2013) | \( ||\varOmega ^{(1)},\varOmega ^{(2)}, \ldots , \varOmega ^{(K)}||_{\mathcal {G},2}\) |
(3) | SIMONE (Chiquet et al. 2011) | \(\sum \nolimits _{i \ne j} \left( \left( \sum \nolimits _{k=1}^T\left( \varOmega _{ij}^{(k)}\right) _{+}^{2}\right) \right) ^{\frac{1}{2}}+\left( \left( \sum \nolimits _{k=1}^K\left( -\varOmega _{ij}^{(k)}\right) _{+}^{2}\right) \right) ^{\frac{1}{2}}\) |
(4) | Node JGL (Mohan et al. 2013) | \(\sum \nolimits _{ij,i>j}RCON(\varOmega ^{(i)} - \varOmega ^{(j)})\) |
(5) | JEM-GM (Guo et al. 2011) | \(\sum \nolimits _{k = 1}^K w_k||\varOmega ^{(k)}||_1 \) |
(6) | MTL-GGM (Honorio and Samaras 2010) | \( ||\varOmega ^{(1)},\varOmega ^{(2)}, \ldots , \varOmega ^{(K)}||_{\mathcal {G},\infty }\) |
(7) | CSSL-GGM (Hara and Washio 2013) | \(||\varOmega _S||_1 + ||\varOmega ^{(1)}_I, \varOmega ^{(2)}_I, \ldots , \varOmega ^{(K)}_I||_{1,p}\) |
In addition to these three baselines, a number of recent studies also perform multi-task learning of sGGM (Honorio and Samaras 2010; Guo et al. 2011; Zhang and Wang 2012; Zhang and Schneider 2010; Zhu et al. 2014). They all follow the same formulation as Eq. (13) but explore a different second penalty-function \(P(\varOmega ^{(1)}, \varOmega ^{(2)}, \ldots ,\varOmega ^{(K)})\). As an example, Node-based JGL proposed a novel penalty, namely RCON (Mohan et al. 2013) (shown as the 4th row of the Table 1) or “row-column overlap norm” for capturing special relationship among graphs. In two recent works, the penalty function at the 5th row is Table 1 has been used by Guo et al. (2011) and the penalty function at the 6th row of Table 1 has been used by Honorio and Samaras (2010).
Furthermore, there exist studies that explore similar motivations as ours when learning multiple GGM models from data. (a) Han et al. (2013) proposed to estimate a population graph from multi-block data using a so-called “median-graph” idea. It is conceptually similar to \(\varOmega _S\). However, they do not have \(\varOmega ^{(i)}_I\) to model individual parts that are specific to each task. (b) Another recent study, CSSL-GGM Hara and Washio (2013) also tried to model both the shared and individual substructures in multi-sGGMs. Different from ours, their formulation is within the penalized likelihood framework as Eq. (13). They used \(\ell \)1,p norm (see last row of Table 1) to regularize the task-specific parts, while SIMULE uses \(\ell \)1 norm instead in Eq. (6). The \(\ell \)1,p norm pushes the individual parts of multiple graphs to be similar which is contradictory to the original purpose of these parameters.^{3} (c) More recently, Monti et al. (2015) proposed to learn population and subject-specific brain connectivity networks via a so-called “Mixed Neighborhood Selection” (MSN) method. Following the neighborhood selection framework (Meinshausen and Bühlmann 2006), for each node v, MSN tried to learn the neighborhood of each v. Similar to SIMULE, they estimated the neighborhood edges of a given node v in the i-task as \(\beta ^v + \widetilde{\varvec{b}}^{(i),v}\). Here \(\beta ^v\) represents the neighbor in the shared part and \(\widetilde{\varvec{b}}^{(i),v}\) represents the neighbors that are specific to the ith graph. Since MSN is specially designed for brain imaging data, it assumes each individual graph is generated by random effects, i.e., \(\widetilde{\varvec{b}}^{(i),v} \sim N(0,\varPhi ^v)\). SIMULE does not have such strong assumptions on either task-specific or task-shared substructures. Our model is more general while MSN is designed for brain imaging data. (d) Another line of related studies (Liu et al. 2013; Sugiyama et al. 2013; Fazayeli and Banerjee 2016) prosed density-ratio based strategies to estimate a differential graph between two graphs. Even though this group of methods can handle the unbalance dataset (i.e., the numbers of samples in two datasets are quite different), they can only capture the difference between two graphs (\(K=2\)). SIMULE does not have such a limitation on the number of tasks. (e) Moreover, several loosely related studies exist in settings different from ours. For example, for handling high-dimensional time series data a few recent papers have considered exploring multiple sGGMs by modeling relationships among networks; e.g., Kolar et al. (2010), Qiu et al. (2013).
4.2 Biasing covariance matrices with SIMONE-I and SIMULE-I
The SIMONE package (Chiquet et al. 2011) has introduced “intertwined Lasso” (named as “SIMONE-I” in the rest of paper) that takes the perspective of sharing the information among covariance matrices. More specifically, this variation averages each task’s sample covariance matrix with a global empirical covariance matrix obtained from the whole dataset. Motivated by SIMONE-I, we extend SIMULE with a similar strategy that revises each task’s sample covariance matrix with \(\tilde{ \varSigma }^{(i)} = \alpha \varSigma ^{(i)} + (1-\alpha ) n_{tot}^{-1}\sum _{t=1}^K {n_t\varSigma ^{(t)}} \). Here \(\alpha = 0.5\) . This variation is referred as “SIMULE-I” for the rest of this paper. We report experimental results from both SIMONE-I and SIMULE-I in Sect. 6.
Fan et al. (2014) have pointed out that sample covariance matrix is inconsistent under a high-dimensional setting. There exists a huge body of previous literature for covariance matrices estimation. Roughly relevant studies can be grouped into three types: (a) Sparse covariance matrix estimation, including studies like hard-thresholding (Lam and Fan 2009), soft-thresholding (Tibshirani 1996), smoothly clipped absolution deviation (SCAD, Fan et al. 2013), and minimax concavity penalties (Zhang 2010). These methods, though simple, do not guarantee the positive definiteness of the estimated covariance matrix. (b) Positive definite sparse covariance matrix estimation that introduces penalty functions on the eigenvector space. Popular studies include Antoniadis and Fan (2011), Liu et al. (2014), Levina et al. (2008), Rothman (2012). (c) Factor-model based covariance matrix estimation like POET (Fan et al. 2013) that applies soft-thresholding to the residual space obtained after applying an approximate factor structure on the estimated covariance matrix. Most of these studies can be used to extend SIMULE. We leave a more thorough study of such combinations as future work.
4.3 Penalized log-likelihood for SIMULE
Comparing Eq. (9) of SIMULE with Eq. (13), previous multi-sGGM approaches mostly relied on penalized log-likelihood functions for learning. We have also considered extending penalized log-likelihood method such as the “graphical lasso” estimator into our MTL setting: \(\varOmega ^{(i)} = \varOmega ^{(i)}_I + \varOmega _S\). However, it is hard to deal with the log determinant term \(logdet(\varOmega ^{(i)}_I + \varOmega _S)\) in optimization. This is because \(\frac{\partial logdet(\mathbf {X}+ \mathbf {Y})}{\partial \mathbf {X}} = (\mathbf {X}+ \mathbf {Y})^{-1}\) and it is difficult to calculate the inverse of \(\mathbf {X}+\mathbf {Y}\). There is no closed form for the inverse of the sum of two matrices, except for certain special cases. From the perspective of optimization, it is hard to directly use methods like coordinate descent for learning such a model due to this first derivative issue. The authors of CSSL-GGM (Hara and Washio 2013) handled the issue by using Alternating Direction Method of Multipliers (ADMM) . This optimization is much more complicated than SIMULE. Another package MNS (Monti et al. 2015) tackled this issue partly through adding latent variables. It assumes that each individual part \(\widetilde{\varvec{b}}^{(i),v} \sim N(0,\varPhi ^v)\), in which \(\varPhi ^v\) is a latent variable learned by the EM algorithm.
4.4 Relevant studies using \(\ell _1\) optimization with constraints
Two categories of relevant studies: learning with “penalized log-likelihood” or learning with “\(\ell \)1 constrained-optimization”
Tasks | Penalized likelihood | \(\ell \)1 constrained-optimization |
---|---|---|
High dimensional linear regression | Lasso: | Dantzig selector: |
\({\mathop {\hbox {argmin}}\nolimits _{\beta }}||\mathbf {Y}- \beta \mathbf {X}||_F + \lambda ||\beta ||_1\) | \({\mathop {\hbox {argmin}}\nolimits _{\beta }} ||\beta ||_1\) subject to : \(||\mathbf {X}^T\mathbf {y}-\mathbf {X}^T\mathbf {X}\beta ||_{\infty }\le \lambda \) | |
sGGM | GLasso: | CLIME: |
\({\mathop {\hbox {argmin}}\nolimits _{\varOmega \ge 0}} -logdet(\varOmega ) + <\varOmega ,\varSigma > + \lambda ||\varOmega ||_1\) | \({\mathop {\hbox {argmin}}\nolimits _{\varOmega \ge 0}} ||\varOmega ||_1\) subject to: \(||\varOmega \varSigma - I||_{\infty } \le \lambda \) | |
Multi-task learning of sGGM | Different penalty: | Our SIMULE: |
\({\mathop {\hbox {argmin}}\nolimits _{\varOmega ^{(i)} > 0}} \sum \limits _i (-L(\varOmega ^{(i)}) + \lambda _1 \sum \limits _i ||\varOmega ^{(i)}||_1 + \lambda _2 P(\varOmega ^{(1)}, \varOmega ^{(2)}, \ldots ,\varOmega ^{(K)}))\) | \({\mathop {\hbox {argmin}}\nolimits _{\varOmega ^{(i)}_I,\varOmega _S}}\sum \limits _i \left| \left| \varOmega ^{(i)}_I\right| \right| _1+ \varepsilon K||\varOmega _S||\) subject to: \( ||\varSigma ^{(i)}(\varOmega ^{(i)}_I + \varOmega _S) - I||_{\infty } \le \lambda _{n}, \; i = 1,\ldots ,K \) |
4.5 Optimization and computational concerns
Furthermore, a number of papers have focused on proposing ways to improve the performance of computation and data storage when estimating sGGMs. For example, the BigQUIC algorithm (Hsieh et al. 2011) aims at an asymptotic quadratic optimization when estimating sGGM. The authors of the Node-based JGL method (Mohan et al. 2013) proposed a block-separate method to improve the computational performance of multi-sGGMs. Our method can be extended to a parallel setting, since it can be naturally decomposed into per column based optimization. Node-based JGL (Mohan et al. 2013) has also used the ADMM optimization algorithm for learning. Compared to linear-programming-based SIMULE, Mohan et al. (2013) has two main disadvantages: (a) a large number of complex optimizations are required to be solved, and (b) the condition of convergence is unspecified. SIMULE makes use of an efficient algorithm, is proven to achieve good numerical performance, and ensures convergence (see Sect. 5).
Furthermore, since imposing an \(\ell _1\) penalty on the model parameters formulates the structure learning of UGMs as a convex optimization problem, this strategy has shown successful results for modeling continuous data with GGM or NGM and discrete data with the pairwise Markov Random Fields (MRFs) (Friedman et al. 2008; Höfling and Tibshirani 2009). However, the discrete case of pairwise MRF is much harder because of the potentially intractable normalizing constant and also the possibility that each edge may have multiple parameters. One more complicating factor about structure learning is that the pairwise assumption might need a few exceptions, like searching for higher-order combinatorial interactions in recent studies like Schmidt and Murphy (2010), Buchman et al. (2012). (Detailed descriptions of related work are not included due to space limitation.)
5 Theoretical Analysis
5.1 Theoretical Analysis for Basic SIMULE
Lemma 1
Proof
If \(ab<0\), let \(c = a + b\) and \(d = 0\). \(cd = 0 \ge 0\) and \(c+d = a+b\). Therefore, \(|c| + \varepsilon |d| = |a+b|<|a| < |a| + \varepsilon |b|\). This contradicts a, b are the optimal solution of Eq. (14). \(\square \)
Corollary 1
Assume \(\widehat{\varOmega }_I^{(i)}\) and \(\widehat{\varOmega }_S\) are the optimal solution of Eq. (7), then
\(||(\varOmega _S + \varOmega _I^{(i)})||_1 = ||\varOmega _S||_1 + ||\varOmega _I^{(i)}||_1\)
Proof
By Lemma 1, we have that \(\widehat{\varOmega }_{I,j,k}^{(i)}\widehat{\varOmega }_{S,j,k} \ge 0\), if \(\widehat{\varOmega }_I^{(i)}\) and \(\widehat{\varOmega }_S\) are the optimal solution of Eq. (7). \(\square \)
We use \(\varSigma _{tot}^0\) to represent the true value of \(\varSigma _{tot}\) and \(\widehat{\varSigma }_{tot}\) as the estimated. We also use \(\varOmega ^0_{tot} = (\omega _1^0,\omega _2^0,\ldots ,\omega _{Kp}^0)\) to describe the true \(\varOmega _{tot}\) and \(\widehat{\varOmega }^1_{tot} = (\hat{\omega }^1_{ij})\) to denote the solution of the optimization problem in Eq. (15). The final solution is denoted as \(\widehat{\varOmega }_{tot} := (\hat{\omega }_{ij}) = (\hat{\omega }_1,\hat{\omega }_2,\ldots ,\hat{\omega }_{Kp})\) where \(\hat{\omega }_{ij} = \hat{\omega }_{ji} = \hat{\omega }^{1}_{ij}\mathop {\mathrm {sign}}( \max (|\hat{\omega }^{1}_{ij}|- |\hat{\omega }^{1}_{ji}|,0)) + \hat{\omega }^{1}_{ji} \mathop {\mathrm {sign}}( \max (|\hat{\omega }^{1}_{ji}|- |\hat{\omega }^{1}_{ij}|, 0))\). Furthermore, we denote \({\mathbb E}[X^{(i)}]\) as \((\mu _1^{(i)},\mu _2^{(i)}, \ldots ,\mu _p^{(i)})^T\).
Lemma 2
Lemma 3
If each \(\varOmega _I^{(i)} + \varOmega _S\) satisfies Condition Eq. (19), then \(\varOmega _{tot}\) also satisfies Condition Eq. (19).
Proof
use the definition of \(\varOmega _{tot}\). \(\square \)
Corollary 2
\(\widehat{\varOmega }_{tot}\) satisfies the condition \(\widehat{\varOmega }_{tot} \succ 0\), with a high probability.
Theorem 3
Theorem 4
Suppose that \(\varOmega ^0_{tot} \in \mathcal {U}(q,s_0(p))\) and \(n_{tot}=\sum _{i=1}^K n_i\).
Through Eq. (23)–(29), we theoretically prove that we can achieve a good estimation of target dependency graphs with the convergence rate \(O(\log (Kp)/n_{tot})\). Based on CLIME (Cai et al. 2011), the convergence rate of single-task sGGM is \(O(\log p/n_i)\). Here \(n_{i}\) represents the number of samples of ith task. Assuming \(n_i = \frac{n_{tot}}{K}\), the convergence rate of single sGGM is \(O(K\log p/n_{tot})\). Clearly, since \(K\log p > \log (Kp)\), the convergence rate of SIMULE is better than single-task sGGM. This provides theoretical proofs for the benefit of multi-tasking sGGM. Neither of these theoretical results have been investigated by the previous studies.
All proofs of above theorems are provided in Sect. 7.
5.2 Theoretical analysis for NSIMULE estimator
In this subsection, we investigate the theoretical properties of NSIMULE estimator and prove that its convergence rate is the same as the SIMULE.
Theorem 5
Based on Liu et al. (2009), Suppose we use the estimated Kendall’s tau correlation matrix \(\widehat{\mathbf {S}}\) to replace \(\widehat{\varSigma }\) in the parametric GLasso estimator. Then under the same conditions on \(\widehat{\varSigma }\) (that ensure the consistency of the estimator under the Gaussian model), the nonparanormal estimator achieves the same (parametric) rate of convergence as GLasso estimator for both the precision matrix estimation and the graph structure recovery.
Theorem 6
We use the estimated Kendall’s tau correlation matrix \(\hat{\mathbf {S}}^{(i)}\) in the Nonparanormal SIMULE estimator. Then under the same conditions on \(\hat{\varSigma }^{(i)}\) (that ensure the consistency of the estimator under the Gaussian model), the nonparanormal SIMULE achieves the same rate of convergence as SIMULE (\(O(\sqrt{\log (Kp)/n_{tot}})\)) for both the graph recovery and precision matrix estimation.
5.3 Potential non-identifiability issue
Linear programming is not strongly convex. Therefore, there may be multiple ideal solutions in the SIMULE formulation of Eq. (7) (i.e., identifiability problem). In fact, the CLIME (Cai et al. 2011) estimator may also have multiple optimal solutions. Cai et al. (2011) have proved all such solutions converge to the true one at an optimal convergence rate. Similarly, in Sect. 5.1, we have proved that SIMULE formulation in Eq. (7) may result in multiple optimal solutions \(\{ \hat{\varOmega }_{tot} \}\). Each of these solutions \(\hat{\varOmega }_{tot}\) converges to the true solution with an optimal convergence rate. We present Theorem 7 showing that for each optimal solution \(\hat{\varOmega }_{tot}\), when \(\varepsilon \ne 1\), we can obtain unique estimation of \(\varOmega _S\) and \(\{ \varOmega _I^{(i)} | i = 1,\ldots ,K\}\).
Theorem 7
When we pick \(\varepsilon > 0\) and \(\varepsilon \ne 1\), for each optimal solution \(\hat{\varOmega }_{tot}\) from Eq. (7), there exist unique \(\varOmega _S\) and \(\{ \varOmega _I^{(i)} | i = 1,\ldots ,K\}\) satisfying Eq. (5).
We provide the proof of Theorem 7 in Sect. 7.
In practice, we need to decide \(\varepsilon \) according to the application for which SIMULE is used. For example, for the genome-related biomedical data, we can normally assume that the shared subgraph is more dense than individual interactions of each context.^{4} Therefore we pick \(\varepsilon < 1\) to reflect this assumption in our experiments in Sect. 6.
6 Experiments
AUC and partial AUC on simulated Gaussian datasets from Model 1
AUC | AUC-individual | AUC-shared | AUC-FPR \(\le \) 20% | AUC-FPR \(\le \) 5% | AUC- \(p = 200\) | |
---|---|---|---|---|---|---|
Gaussian-model1-[\(K=3\)]-[\(p=100\)] | ||||||
NSIMULE | 0.9872 | 0.8408 | 0.8964 | 0.1599 | 0.0188 | 0.9959 |
SIMULE | 0.9844 | 0.8379 | 0.8788 | 0.1587 | 0.0179 | 0.9945 |
JGL-fused | 0.6843 | 0.5666 | 0.9817 | 0.0989 | 0.0094 | 0.6745 |
JGL-group | 0.5162 | 0.4988 | 0.5759 | 0.0908 | 0.0174 | 0.5122 |
SIMONE | 0.7748 | 0.5124 | 0.9321 | 0.0992 | 0.0171 | 0.5488 |
CLIME | 0.6509 | 0.5197 | 0.7795 | 0.0439 | 0.0001 | 0.5422 |
NCLIME | 0.5400 | 0.4999 | 0.8224 | 0.0434 | 0.0001 | 0.5216 |
SIMONE-I | 0.8041 | 0.6740 | 0.9681 | 0.1030 | 0.0177 | 0.6122 |
SIMULE-I | 0.9979 | 0.8594 | 0.9249 | 0.1604 | 0.0183 | 0.9984 |
AUC and partial AUC on simulated nonparanormal datasets from Model 1
AUC | AUC-individual | AUC-shared | AUC-FPR \(\le \) 20% | AUC-FPR \(\le \) 5% | AUC- \(p = 200\) | |
---|---|---|---|---|---|---|
Nonparanormal-model1-[\(K = 3\)]-[\(p=100\)] | ||||||
NSIMULE | 0.8172 | 0.8408 | 0.8964 | 0.1599 | 0.0188 | 0.8325 |
SIMULE | 0.7322 | 0.8165 | 0.8788 | 0.1567 | 0.0173 | 0.7745 |
JGL-fused | 0.6942 | 0.6362 | 0.9817 | 0.0993 | 0.0124 | 0.7308 |
JGL-fused-nonparanormal | 0.6978 | 0.6374 | 0.9896 | 0.1012 | 0.0137 | 0.7343 |
JGL-group | 0.5181 | 0.4942 | 0.8050 | 0.0487 | 0.0064 | 0.5416 |
JGL-group-nonparanormal | 0.6500 | 0.5563 | 0.8614 | 0.0868 | 0.0113 | 0.5498 |
SIMONE | 0.7198 | 0.5061 | 0.9321 | 0.1080 | 0.0102 | 0.5671 |
SIMONE-nonparanormal | 0.7271 | 0.5072 | 0.9327 | 0.1146 | 0.0135 | 0.5766 |
CLIME | 0.5803 | 0.5108 | 0.7745 | 0.0427 | 0.0001 | 0.5127 |
NCLIME | 0.5298 | 0.4935 | 0.8224 | 0.0398 | 0.0001 | 0.4915 |
6.1 Experimental settings
6.1.1 Baselines
We compare SIMULE, SIMULE-I and NSIMULE with the following baselines: (1) Three different multi-sGGM estimators including JGL-fused, JGL-group (Danaher et al. 2013), and SIMONE (Chiquet et al. 2011) (with the penalty functions described in Table 1); (2) The single-task CLIME baseline (i.e. each task uses CLIME independently); (3) The single-task nonparanormal CLIME (NCLIME) baseline (i.e., each task uses NCLIME independently). (4) The nonparanormal extension of JGL-fused, JGL-group, and SIMONE.^{5} (5) SIMONE-I (“intertwined Lasso” in SIMONE package) is added as a baseline for comparison with SIMULE-I.^{6}
6.1.2 Metric
AUC and partial AUC on simulated Gaussian datasets from Model 2
AUC | AUC-individual | AUC-shared | AUC-FPR \(\le \) 20% | AUC-FPR \(\le \) 5% | AUC- \(p = 200\) | |
---|---|---|---|---|---|---|
Gaussian-model2-[\(K=2\)] | ||||||
NSIMULE | 0.9997 | 0.8095 | 0.9727 | 0.1997 | 0.0497 | 1.0000 |
SIMULE | 0.9996 | 0.8391 | 0.9697 | 0.1997 | 0.0497 | 0.9998 |
JGL-fused | 0.9991 | 0.4893 | 0.9983 | 0.1991 | 0.0491 | 0.9993 |
JGL-group | 0.9999 | 0.5000 | 0.7715 | 0.1999 | 0.0499 | 0.9866 |
SIMONE | 0.9989 | 0.7632 | 0.9982 | 0.1989 | 0.0489 | 0.9990 |
CLIME | 0.5077 | 0.3948 | 0.7517 | 0.0404 | 0.0025 | 0.5037 |
NCLIME | 0.4995 | 0.4043 | 0.7614 | 0.0402 | 0.0025 | 0.5022 |
SIMONE-I | 0.9995 | 0.7692 | 0.9986 | 0.1995 | 0.0499 | 0.9951 |
SIMULE-I | 0.9997 | 0.8704 | 0.9997 | 0.1999 | 0.0499 | 1.0000 |
AUC and partial AUC on simulated nonparanormal datasets from Model 2
AUC | AUC-individual | AUC-shared | AUC-FPR \(\le \) 20% | AUC-FPR \(\le \) 5% | AUC- \(p = 200\) | |
---|---|---|---|---|---|---|
Nonparanormal-model2-[\(K=2\)] | ||||||
NSIMULE | 0.9993 | 0.8095 | 0.9727 | 0.1993 | 0.0493 | 1.0000 |
SIMULE | 0.9993 | 0.8453 | 0.9609 | 0.1929 | 0.0419 | 0.9996 |
JGL-fused | 0.9984 | 0.5117 | 0.9984 | 0.1984 | 0.0424 | 0.9996 |
JGL-fused-nonparanormal | 0.9990 | 0.5641 | 0.9998 | 0.1986 | 0.0425 | 0.9996 |
JGL-group | 0.9784 | 0.6791 | 0.9151 | 0.1967 | 0.0464 | 1.0000 |
JGL-group-nonparanormal | 0.9899 | 0.6948 | 0.9391 | 0.1969 | 0.0465 | 1.0000 |
SIMONE | 0.9991 | 0.7529 | 0.9911 | 0.1990 | 0.0491 | 0.9983 |
SIMONE-nonparanormal | 0.9992 | 0.7960 | 0.9948 | 0.1993 | 0.0492 | 0.9986 |
CLIME | 0.4985 | 0.3740 | 0.7517 | 0.0238 | 0.0001 | 0.5041 |
NCLIME | 0.4994 | 0.4042 | 0.7614 | 0.0203 | 0.0001 | 0.5022 |
6.1.3 Selection of hyper-parameter \(\lambda _{n}\)
Recent research studies from Negahban et al. (2009) and Yang et al. (2014) conclude that the regularization parameter \(\lambda \) of a single-task sGGM (e.g., with \(n_i\) samples) should satisfy \(\lambda \propto \sqrt{\frac{\log p}{n_i}}\). Combining this conclusion with our theoretical analysis in Sect. 5, we choose \(\lambda _{n} = \alpha \sqrt{\frac{\log (Kp)}{n_{tot}}}\) where \(\alpha \) for SIMULE or NSIMULE is a hyper-parameter to tune. In our experiments, \(\alpha \) is varied over a range of \( \{0.05 \times i| i\in \{1,2,3,\ldots ,30 \}\}\). The Bayesian information criterion (BIC) is used for situations requiring to select a specific value of hyper-parameters.
Besides, we also need to tune the hyper-parameters of the baseline methods to obtain their FPR–TPR curves. If only one hyper-parameter needs tuning, we follow the same strategy as SIMULE. For those baselines (JGL-fused and JGL-group) having two hyper-parameters, when given a certain \(\lambda _1\) (the same as \(\lambda _n\)), we use BIC criteria to select its best \(\lambda _2\) from a range of \(\{0.05\times i| i\in \{1,2,3,\ldots ,20 \}\}\).
6.1.4 Selection of hyper-parameter \(\varepsilon \)
\(\varepsilon \) reflects the difference of sparsity in the shared subgraph versus the context-specific subgraphs. Section 5.3 has discussed our choice of \(\varepsilon \) on two real-world datasets. Similarly for the simulated experiments, we select \(\varepsilon \) from a range of \(\{0.1\times i| i\in \{1,2,\ldots ,9\}\}\).
6.2 Simulated Gaussian datasets
Model 1 Coming from Rothman et al. (2008), this model assumes \(\varOmega ^{(i)} = \mathbf {B}^{(i)}_I + \mathbf {B}_S + \delta ^{(i)}I\), where each off-diagonal entry in \(\mathbf {B}^{(i)}_I\) is generated independently and equal to 0.5 with probability 0.05i and 0 with probability \(1 - 0.05i\). The shared part \(\mathbf {B}_S\) is generated independently and equal to 0.5 with probability 0.1 and 0 with probability 0.9. \(\delta ^{(i)}\) is selected large enough to guarantee the positive definiteness of precision matrix. A clear shared structure \(\mathbf {B}_S\) exists among multiple graphs. We choose \(K \in \{2,3,4,5,6\}\) for this case.
Model 2 This model uses two special-structure graphs, i.e., a grid graph and a ring graph. The first task uses a grid graph and the second uses a ring graph. These two special networks are popular in many real-world applications . For instance, certain biological pathways can be represented as rings. A clear shared structure exists between these two graphs. Clearly, \(K = 2\) for this case.
In summary, we first simulate precision matrices by Model 1 or Model 2. We then use multivariate distribution method (Ripley 2009) to sample multivariate Gaussian Distributed data blocks with mean 0 and covariance matrix \((\varOmega ^{(i)})^{-1}\). This stochastic procedure will generate simulated data blocks with the decomposition in Eq. (5). Then we apply SIMULE, NSIMULE and baseline models on these datasets to obtain the estimated dependency networks.
6.3 Simulated nonparanormal datasets
Using the same graphs of Model 1 and Model 2, we simulate two more sets of data samples following the nonparanormal distributions in K different tasks. We pick \(K=3\) for Model 1 and \(K =2\) for Model 2. Starting from \(N(0, (\varPhi ^{(i)})^{-1})\) and transforming with a monotone function \(x : \rightarrow sign(\mathbf {x})|\mathbf {x}|^{\frac{1}{2}}\), data sample is generated as a random vector Z, where \(sign(Z)Z^2 = (sign(\mathbf {z}_1)\mathbf {z}_1^2,\ldots ,sign(\mathbf {z}_p)\mathbf {z}_p^2) \sim N(0, (\varPhi ^{(i)})^{-1})\). Thus Z follows a nonparanormal distribution. Using different monotone functions for generating simulated data will not change the resulting correlation matrix \(\mathbf {S}\) since we use the rank-based nonparametric estimator estimating the correlation matrix.
6.4 Experimental results on synthetic datasets
Figure 3 shows more detailed FPR–TPR comparisons among the joint sGGM estimators. The sub-figure (a) “Gaussian-Model1” clearly shows that our methods obtain better curves than three multi-sGGM baselines. On the sub-figure (b) “Gaussian-Model2”, the differences among multi-sGGM estimators are not as apparent as “Gaussian-Model1”. Figure 3c, d (in the second row) show FPR–TPR curves from two nonparanormal datasets. Overall, the observations from comparing these curves are consistent with those obtained from the four tables (Tables 3, 4, 5, 6).
One important hyper-parameter we need to pick for SIMULE and NSIMULE is \(\varepsilon \) in Eq. (7). This hyper-parameter reflects the sparsity level of the shared subgraph against the context-specific parts. The left two sub-figures (a) and (c) of Fig. 5 show the changes of sparsity level in both individual and shared subgraphs across multiple values of \(\varepsilon \) (by running SIMULE on the Gaussian-Model 1 case). Figure 5a is for \(K=3\) and Fig. 5c is for \(K= 6\). We can see that when \(\varepsilon \) increases, the sparsity level of shared portion decreases while the averaging sparsity of individual parts increases across both cases of K. This matches our analysis in Sect. 5.3. In real applications, \(\varepsilon \) indicates the differences of sparsity constraints we assume on shared and individual parts. It should be chosen according to the domain knowledge of a specific application.
In addition, Fig. 5e tries to investigate whether the changes of \(\varepsilon \) influence the performance (AUC) of (N)SIMULE. On datasets from Model 1, Fig. 5d shows that AUC scores of both SIMULE and NSIMULE exhibit very small variations across a large range of changing \(\varepsilon \).
Section 2.5 presents a parallel variation of SIMULE. On the simulated data of Gaussian-Model 1, we compare the training speed of original-SIMULE versus parallel-SIMULE using Fig. 5b (the right upper sub-figure). For the parallel-SIMULE, we run SIMULE by paralleling “column per core” using 63 cores on a 64-core machine. The baselines, including the original-SIMULE, JGL-fused, JGL-group, SIMONE and SIMONE-I, are run on one restricted core on the same machine.^{8} Figure 5b provides the computational speed (training time in \(\log \)-seconds) across values of dimension p. It clearly shows that the parallel-SIMULE runs much faster than the single-core SIMULE implementation and other baselines.^{9}
Furthermore, Fig. 5d, f provide AUC scores of SIMULE, three multi-SGGM baselines, SIMONE-I and SIMULE-I against the varying p and against the with varying K. SIMULE-I provides a consistent performance improvement over SIMULE. SIMULE outperforms three multi-sGGM baselines across multiple values of p and K. SIMULE-I outperforms SIMONE-I and runs smoothly for larger values of K. Unfortunately Fig. 5f can not provide the AUC scores of SIMONE-I for \(K=5\) and \(K=6\), because the SIMONE package could not converge for some \(\lambda \) values under these two cases of K.
6.5 Experiment results on real application I: identifying gene interaction using gene expression data across two cell contexts
Next, we apply SIMULE and the baselines on one real-world biomedical data: gene expression profiles describing many human samples across multiple cancer types (aggregated by McCall et al. 2011). Recently advancements in genome-wide monitoring have resulted in enormous amounts of data across most of the common cell contexts, like multiple common cancer types (The Cancer Genome Atlas Research Network 2011). Complex diseases such as cancer are the result of multiple genetic and epigenetic factors. Thus, recent research has shifted towards the identification of multiple genes/proteins that interact directly or indirectly in contributing to certain disease(s). Structure learning of UGMs on such heterogeneous datasets can uncover statistical dependencies among genes and understand how such dependencies vary from normal to abnormal or across different diseases. These structural variations are highly likely to be contributing markers that influence or cause the diseases.
6.6 Experiment results on real application II : identifying collaborations among TFs across multiple cell types
In molecular biology, the regulatory proteins that interact with one another to control gene transcription are known as transcription factors (TFs). TF proteins typically perform major cell regulatory functions (e.g., binding on DNA) by working together with other TFs. The collaboration patterns (e.g., conditional independence) among TFs normally vary across different cell contexts (e.g., cell lines). Meanwhile, a certain portion of the TF interactions are preserved across contexts. Understanding the collaboration networks among TFs is the key to understanding cell development, including defects, which lead to different diseases. The ChIP-Seq datasets recently made available by the ENCODE project (ENCODE Project Consortium 2011) provide simultaneous binding measurements of TFs to thousands of gene targets. These measurements provide a “snapshot” of TF binding events across many cell contexts. The task of uncovering the functional dependencies among TFs connects to the task of discovering the statistical dependencies among TFs from their ChIP-Seq measurements.
Recently, two relevant papers (Cheng et al. 2011; Min et al. 2014) have discussed methods to infer co-association networks among TFs using ChIP-Seq data. Their approaches differ from ours as both projects has targeted a single cell type at a time. We select ChIP-Seq data for 27 TFs that are covered by ENCODE (ENCODE Project Consortium 2012) across three major human cell lines including (1) H1-hESC (embryonic stem cells : primary tissue), (2) GM12878 (B-lymphocyte:normal tissue) and (3) K562 (leukemia:cancerous tissue).
We apply SIMULE, SIMULE-I, JGL-group, JGL-fused, SIMONE and SIMONE-I to this multi-cell TF ChIP-Seq dataset. Comparisons of different methods are performed using three major existing protein interaction databases (Prasad et al. 2009; Orchard et al. 2013; Stark et al. 2006). The numbers of matches between TF-TF interactions in databases and those predicted by each method have been plotted as a bar graph shown in Fig. 6b (the right subfigure). The graph shows that SIMULE consistently outperforms JGL-fused, JGL-group and SIMONE on both individual and shared interactions from all three cell types. SIMULE-I performs better than SIMONE-I and SIMULE. We further evaluated the resulting TF interactions using the popular “functional enrichment” analysis with DAVID (Da Wei Huang and Lempicki 2008) and found that SIMULE and SIMULE-I can reveal known functional sets and potentially novel interactions that drive leukemia. This leads us to believe that our approach can be used in a wider range of applications as well.
Many domain-specific studies have applied sGGM on real-world datasets, especially those from molecular biology or brain science. For example, Ma et al. (2007) estimates gene networks using miroarray data in the model species Arabidopsis. Krumsiek et al. (2011) explores GGM to reconstruct pathway reactions from high-throughput metabolomics data. Within brain science, a few studies (Ng et al. 2013; Huang et al. 2010; Monti et al. 2015; Sun et al. 2009) have tried to learn brain connectivity of Alzheimer’s disease through sparse inverse covariance estimation. Due to space limit, we omit detailed inclusions of these studies.
7 Proof of Theorems
In this section, I(S) denotes the indicator function of the set S.
7.1 Lemma 2
Proof
In the following, we just show the case of \(1 \le j \le p\). When \(j > p\), by replacing \(e_{j \mod p}\) in the following proofs, it will still hold. \(||\varSigma _{tot}\hat{\varOmega }_{tot}^1 - I||_{\infty } \le \lambda _n \) is equivalent to \(||\varSigma _{tot}\hat{\omega }_j^1 - e_j||_{\infty } \le \lambda _n\) for \(1 \le j \le p\).
7.2 Theorem 3
Proof
Finally, Inequality (22) can be derived from Inequality (20), Inequality (36) and the inequality relationship \(||\mathbf {A}||^2_{F} \le p||\mathbf {A}||_1||\mathbf {A}||_{\infty }\). \(\square \)
7.3 Theorem 4
Proof
Theorem 4(a)
Proof
Theorem 4(b)
Let \(\bar{\mathbf {Y}}_{kij} = \mathbf {X}_{ki}\mathbf {X}_{kj}I\{|\mathbf {X}_{ki}\mathbf {X}_{kj}|\le \sqrt{n_{tot}/(\log Kp)^3}\} - {\mathbb E}\mathbf {X}_{ki}\mathbf {X}_{kj}I\{|\mathbf {X}_{ki}\mathbf {X}_{kj}|\le \sqrt{n_{tot}/(\log Kp)^3}\}\), \(\check{\mathbf {Y}}_{kij} = \mathbf {Y}_{kij} - \bar{\mathbf {Y}}_{kij}\).
Since
7.4 Theorem 7
Proof
8 Conclusions
This paper introduces a novel method SIMULE for learning shared and distinct patterns simultaneously when inferring multiple sGGMs or sNGMs jointly. Through \(\ell \)1-constrained formulation our solution is efficient and can be parallelized. We successfully apply SIMULE on four synthetic datasets and two real-world datasets. We prove the convergence property of SIMULE to be favorable and justify the benefit of multitasking. Future work will extend SIMULE to model more complex relationships among contexts.
Footnotes
- 1.
Following convention, \(\varSigma \), \(\varOmega \), \(\varPhi \), \(\beta \), \(\mu \), \(\theta \) and I are not bold.
- 2.
- 3.
We can not find CSSL-GGM implementation, therefore can not include it as a baseline.
- 4.
This assumes more interactions are preserved across cell contexts, i.e., partly due to concerns of system or evolutionary stability.
- 5.
We extend JGL and SIMONE on nonparanormal distributions by replacing the codes for sample covariance matrix into Kendall’s tau correlation matrix in their R implementations.
- 6.
It is possible to combine nonparanormal and intertwined strategies to extend SIMULE. We leave this as a future work.
- 7.
We revise the R packages of three baselines to extend them to nonparanormal distribution. This is the provide a more fair comparison of these baselines versus NSIMULE.
- 8.
The multi-core setting we use for this time experiment is to reflect the distributed parallel nature of SIMULE. We leave the topic of using the multi-threading to improve SIMULE as future research.
- 9.
When \(p \ge 400\), SIMONE-I takes more than 5 days to train. That’s why the no data points are shown in Fig. 5b for such cases.
- 10.
NSIMULE was tried as well and has achieved the same validation result as SIMULE.
- 11.
We would like to point out that the interactions SIMULE finds represent statistical dependencies between genes that vary across multiple cell types. There exist many possibilities for such interactions, including like physical protein-protein interactions, regulatory gene pairs or signaling relationships. Therefore, we combine multiple existing databases for a joint validation.
Notes
Acknowledgements
This work was supported by the National Science Foundation under NSF CAREER Award No. 1453580. Any Opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect those of the National Science Foundation.
References
- Antoniadis, A., & Fan, J. (2011). Regularization of wavelet approximations. Journal of the American Statistical Association, 96(455), 939–967.MathSciNetCrossRefMATHGoogle Scholar
- Banerjee, O., El Ghaoui, L., & d’Aspremont, A. (2008). Model selection through sparse maximum likelihood estimation for multivariate Gaussian or binary data. The Journal of Machine Learning Research, 9, 485–516.MathSciNetMATHGoogle Scholar
- Boyd, S. P., & Vandenberghe, L. (2004). Convex optimization. Cambridge: Cambridge University Press.CrossRefMATHGoogle Scholar
- Buchman, D., Schmidt, M., Mohamed, S., Poole, D., & de Freitas, N. (2012). On sparse, spectral and other parameterizations of binary probabilistic models. In AISTATS (pp. 173–181)Google Scholar
- Cai, T., Liu, W., & Luo, X. (2011). A constrained 1 minimization approach to sparse precision matrix estimation. Journal of the American Statistical Association, 106(494), 594–607.MathSciNetCrossRefMATHGoogle Scholar
- Candes, E., & Tao, T. (2007). The Dantzig selector: Statistical estimation when p is much larger than n. The Annals of Statistics, 35(6), 2313–2351.MathSciNetCrossRefMATHGoogle Scholar
- Caruana, R. (1997). Multitask learning. Machine Learning, 28(1), 41–75.MathSciNetCrossRefGoogle Scholar
- Cheng, C., Yan, K. K., Hwang, W., Qian, J., Bhardwaj, N., Rozowsky, J., et al. (2011). Construction and analysis of an integrated regulatory network derived from high-throughput sequencing data. PLoS Computational Biology, 7(11), e1002190.CrossRefGoogle Scholar
- Chiquet, J., Grandvalet, Y., & Ambroise, C. (2011). Inferring multiple graphical structures. Statistics and Computing, 21(4), 537–553.MathSciNetCrossRefMATHGoogle Scholar
- Da Wei Huang, B. T. S., & Lempicki, R. A. (2008). Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nature Protocols, 4(1), 44–57.CrossRefGoogle Scholar
- Danaher, P., Wang, P., & Witten, D. M. (2013). The joint graphical lasso for inverse covariance estimation across multiple classes. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 76(2), 373–397.MathSciNetCrossRefGoogle Scholar
- Di Martino, A., Yan, C. G., Li, Q., Denio, E., Castellanos, F. X., Alaerts, K., et al. (2014). The autism brain imaging data exchange: Towards a large-scale evaluation of the intrinsic brain architecture in autism. Molecular Psychiatry, 19(6), 659–667.CrossRefGoogle Scholar
- ENCODE Project Consortium. (2011). A user’s guide to the encyclopedia of DNA elements (ENCODE). PLoS Biology, 9(4), e1001046.Google Scholar
- ENCODE Project Consortium. (2012). An integrated encyclopedia of DNA elements in the human genome. Nature, 489(7414), 57–74.Google Scholar
- Evgeniou, T., & Pontil, M. (2004). Regularized multi-task learning. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 109–117). ACM.Google Scholar
- Fan, J., Han, F., & Liu, H. (2014). Challenges of big data analysis. National Science Review,. doi:10.1093/nsr/nwt032.Google Scholar
- Fan, J., Liao, Y., & Mincheva, M. (2013). Large covariance estimation by thresholding principal orthogonal complements. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 75(4), 603–680.MathSciNetCrossRefGoogle Scholar
- Fazayeli, F., & Banerjee, A. (2016). Generalized direct change estimation in ising model structure. arXiv preprint arXiv:1606.05302.
- Friedman, J., Hastie, T., & Tibshirani, R. (2008). Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 9(3), 432–441.CrossRefMATHGoogle Scholar
- Guo, J., Levina, E., Michailidis, G., & Zhu, J. (2011). Joint estimation of multiple graphical models. Biometrika,. doi:10.1093/biomet/asq060.MathSciNetMATHGoogle Scholar
- Han, F., Liu, H., & Caffo, B. (2013). Sparse median graphs estimation in a high dimensional semiparametric model. arXiv preprint arXiv:1310.3223.
- Hara, S., & Washio, T. (2013). Learning a common substructure of multiple graphical Gaussian models. Neural Networks, 38, 23–38.CrossRefMATHGoogle Scholar
- Hastie, T., Tibshirani, R., Friedman, J., Hastie, T., Friedman, J., & Tibshirani, R. (2009). The elements of statistical learning. Berlin: Springer.CrossRefMATHGoogle Scholar
- Höfling, H., & Tibshirani, R. (2009). Estimation of sparse binary pairwise markov networks using pseudo-likelihoods. The Journal of Machine Learning Research, 10, 883–906.MathSciNetMATHGoogle Scholar
- Honorio, J., & Samaras, D. (2010). Multi-task learning of Gaussian graphical models. In Proceedings of the 27th international conference on machine learning (ICML-10) (pp. 447–454).Google Scholar
- Hsieh, C. J., Sustik, M. A., Dhillon, I. S., & Ravikumar, P. D. (2011). Sparse inverse covariance matrix estimation using quadratic approximation. In NIPS (pp. 2330–2338).Google Scholar
- Huang, S., Li, J., Sun, L., Ye, J., Fleisher, A., Wu, T., et al. (2010). Learning brain connectivity of alzheimer’s disease by sparse inverse covariance estimation. NeuroImage, 50(3), 935–949.CrossRefGoogle Scholar
- Ideker, T., & Krogan, N. J. (2012). Differential network biology. Molecular Systems Biology, 8(1), 565.Google Scholar
- Kelly, C., Biswal, B. B., Craddock, R. C., Castellanos, F. X., & Milham, M. P. (2012). Characterizing variation in the functional connectome: Promise and pitfalls. Trends in Cognitive Sciences, 16(3), 181–188.CrossRefGoogle Scholar
- Kolar, M., Song, L., Ahmed, A., Xing, E. P., et al. (2010). Estimating time-varying networks. The Annals of Applied Statistics, 4(1), 94–123.MathSciNetCrossRefMATHGoogle Scholar
- Krumsiek, J., Suhre, K., Illig, T., Adamski, J., & Theis, F. J. (2011). Gaussian graphical modeling reconstructs pathway reactions from high-throughput metabolomics data. BMC Systems Biology, 5(1), 21.CrossRefGoogle Scholar
- Lam, C., & Fan, J. (2009). Sparsistency and rates of convergence in large covariance matrix estimation. Annals of Statistics, 37(6B), 4254.MathSciNetCrossRefMATHGoogle Scholar
- Lauritzen, S. L. (1996). Graphical models. Oxford: Oxford University Press.MATHGoogle Scholar
- Levina, E., Rothman, A., Zhu, J., et al. (2008). Sparse estimation of large covariance matrices via a nested lasso penalty. The Annals of Applied Statistics, 2(1), 245–263.MathSciNetCrossRefMATHGoogle Scholar
- Liu, H., Han, F., & Zhang, C. (2012). Transelliptical graphical models. In Advances in Neural Information Processing Systems (pp. 809–817).Google Scholar
- Liu, H., Lafferty, J., & Wasserman, L. (2009). The nonparanormal: Semiparametric estimation of high dimensional undirected graphs. The Journal of Machine Learning Research, 10, 2295–2328.MathSciNetMATHGoogle Scholar
- Liu, H., Wang, L., & Zhao, T. (2014). Sparse covariance matrix estimation with eigenvalue constraints. Journal of Computational and Graphical Statistics, 23(2), 439–459.MathSciNetCrossRefGoogle Scholar
- Liu, S., Quinn, J. A., Gutmann, M. U., & Sugiyama, M. (2013). Direct learning of sparse changes in Markov networks by density ratio estimation. In Joint European conference on machine learning and knowledge discovery in databases (pp. 596–611). Springer.Google Scholar
- Ma, S., Gong, Q., & Bohnert, H. J. (2007). An arabidopsis gene network based on the graphical Gaussian model. Genome Research, 17(11), 1614–1625.CrossRefGoogle Scholar
- Mardia, K. V., Kent, J. T., & Bibby, J. M. (1980). Multivariate analysis. London: Academic Press.MATHGoogle Scholar
- McCall, M. N., Uppal, K., Jaffee, H. A., Zilliox, M. J., & Irizarry, R. A. (2011). The gene expression barcode: Leveraging public data repositories to begin cataloging the human and murine transcriptomes. Nucleic Acids Research, 39(suppl 1), D1011–D1015.CrossRefGoogle Scholar
- Meinshausen, N., & Bühlmann, P. (2006). High-dimensional graphs and variable selection with the lasso. The Annals of Statistics, 34(3), 1436–1462.MathSciNetCrossRefMATHGoogle Scholar
- Min, M. R., Ning, X., Cheng, C., & Gerstein, M. (2014). Interpretable sparse high-order Boltzmann machines. In Proceedings of the seventeenth international conference on artificial intelligence and statistics (pp. 614–622).Google Scholar
- Mohan, K., London, P., Fazel, M., Lee, S. I., & Witten, D. (2013). Node-based learning of multiple Gaussian graphical models. arXiv preprint arXiv:1303.5145.
- Monti, R. P., Anagnostopoulos, C., & Montana, G. (2015). Learning population and subject-specific brain connectivity networks via mixed neighborhood selection. arXiv preprint arXiv:1512.01947.
- Negahban, S., Yu, B., Wainwright, M. J., & Ravikumar, P. K. (2009). A unified framework for high-dimensional analysis of \( m \)-estimators with decomposable regularizers. In Advances in Neural Information Processing Systems (pp. 1348–1356).Google Scholar
- Ng, B., Varoquaux, G., Poline, J. B., & Thirion, B. (2013). A novel sparse group Gaussian graphical model for functional connectivity estimation. In Information processing in medical imaging (pp. 256–267). Springer.Google Scholar
- Orchard, S., Ammari, M., Aranda, B., Breuza, L., Briganti, L., Broackes-Carter, F., et al. (2013). The MIntAct project IntAct as a common curation platform for 11 molecular interaction databases. Nucleic Acids Research,. doi:10.1093/nar/gkt1115.Google Scholar
- Pang, H., Liu, H., & Vanderbei, R. (2014). The fastclime package for linear programming and large-scale precision matrix estimation in R. Journal of Machine Learning Research, 15, 489–493.MATHGoogle Scholar
- Prasad, T. K., Goel, R., Kandasamy, K., Keerthikumar, S., Kumar, S., Mathivanan, S., et al. (2009). Human protein reference database 2009 update. Nucleic Acids Research, 37(suppl 1), D767–D772.CrossRefGoogle Scholar
- Qiu, H., Han, F., Liu, H., & Caffo, B. (2013). Joint estimation of multiple graphical models from high dimensional time series. arXiv preprint arXiv: 1311.0219.
- Ripley, B. D. (2009). Stochastic simulation (Vol. 316). London: Wiley.MATHGoogle Scholar
- Rothman, A. J. (2012). Positive definite estimators of large covariance matrices. Biometrika, 99(3), 733–740.MathSciNetCrossRefMATHGoogle Scholar
- Rothman, A. J., Bickel, P. J., Levina, E., Zhu, J., et al. (2008). Sparse permutation invariant covariance estimation. Electronic Journal of Statistics, 2, 494–515.MathSciNetCrossRefMATHGoogle Scholar
- Schmidt, M., & Murphy, K. (2010). Convex structure learning in log-linear models: Beyond pairwise potentials. In Proceedings of the international conference on artificial intelligence and statistics (AISTATS).Google Scholar
- Stark, C., Breitkreutz, B. J., Reguly, T., Boucher, L., Breitkreutz, A., & Tyers, M. (2006). Biogrid: A general repository for interaction datasets. Nucleic Acids Research, 34(suppl 1), D535–D539.CrossRefGoogle Scholar
- Sugiyama, M., Kanamori, T., Suzuki, T., du Plessis, M. C., Liu, S., & Takeuchi, I. (2013). Density-difference estimation. Neural Computation, 25(10), 2734–2775.MathSciNetCrossRefGoogle Scholar
- Sun, L., Patel, R., Liu, J., Chen, K., Wu, T., Li, J., et al. (2009). Mining brain region connectivity for Alzheimer’s disease study via sparse inverse covariance estimation. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 1335–1344). ACM.Google Scholar
- The Cancer Genome Atlas Research Network. (2011). Integrated genomic analyses of ovarian carcinoma. Nature, 474(7353), 609–615.Google Scholar
- Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), 267–288.Google Scholar
- Wainwright, M. J., & Jordan, M. I. (2006). Log-determinant relaxation for approximate inference in discrete Markov random fields. IEEE Transactions on Signal Processing, 54(6), 2099–2109.CrossRefGoogle Scholar
- Yang, E., Lozano, A. C., & Ravikumar, P. K. (2014). Elementary estimators for graphical models. In Advances in neural information processing systems (pp. 2159–2167).Google Scholar
- Yuan, M., & Lin, Y. (2007). Model selection and estimation in the Gaussian graphical model. Biometrika, 94(1), 19–35.MathSciNetCrossRefMATHGoogle Scholar
- Zhang, B., & Wang, Y. (2012). Learning structural changes of Gaussian graphical models in controlled experiments. arXiv preprint arXiv:1203.3532.
- Zhang, C. H. (2010). Nearly unbiased variable selection under minimax concave penalty. The Annals of statistics, 38(2), 894–942.MathSciNetCrossRefMATHGoogle Scholar
- Zhang, Y., & Schneider, J. G. (2010). Learning multiple tasks with a sparse matrix-normal penalty. In Advances in neural information processing systems (pp. 2550–2558).Google Scholar
- Zhu, Y., Shen, X., & Pan, W. (2014). Structural pursuit over multiple undirected graphs. Journal of the American Statistical Association, 109(508), 1683–1696.MathSciNetCrossRefMATHGoogle Scholar
Copyright information
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.