A constrained \(\ell \)1 minimization approach for estimating multiple sparse Gaussian or nonparanormal graphical models
 2.1k Downloads
Abstract
Identifying contextspecific entity networks from aggregated data is an important task, arising often in bioinformatics and neuroimaging applications. Computationally, this task can be formulated as jointly estimating multiple different, but related, sparse undirected graphical models (UGM) from aggregated samples across several contexts. Previous jointUGM studies have mostly focused on sparse Gaussian graphical models (sGGMs) and can’t identify contextspecific edge patterns directly. We, therefore, propose a novel approach, SIMULE (detecting Shared and Individual parts of MULtiple graphs Explicitly) to learn multiUGM via a constrained \(\ell \)1 minimization. SIMULE automatically infers both specific edge patterns that are unique to each context and shared interactions preserved among all the contexts. Through the \(\ell \)1 constrained formulation, this problem is cast as multiple independent subtasks of linear programming that can be solved efficiently in parallel. In addition to Gaussian data, SIMULE can also handle multivariate Nonparanormal data that greatly relaxes the normality assumption that many realworld applications do not follow. We provide a novel theoretical proof showing that SIMULE achieves a consistent result at the rate \(O(\log (Kp)/n_{tot})\). On multiple synthetic datasets and two biomedical datasets, SIMULE shows significant improvement over stateoftheart multisGGM and singleUGM baselines (SIMULE implementation and the used datasets @https://github.com/QData/SIMULE).
Keywords
Graphical model Multitask learning Computational biologyList of symbols
 \(\varSigma \)
Covariance matrix
 \(\mu \)
Mean vector in the Gaussian distribution
 \(\varOmega \)
Precision matrix
 \(\mathbf {X}\)
Data sample matrix
 \(X_j\)
A random variable follows the Gaussian Distribution
 \(\varSigma ^{(i)}\)
ith Covariance matrix
 \(\varOmega ^{(i)}\)
ith Precision matrix in a multitask setting
 \(\varOmega _S\)
Shared pattern among all precision matrices in a multitask setting
 \(\varOmega ^{(i)}_I\)
Individual part of ith Precision matrix in a multitask setting
 \(\mathbf {X}^{(i)}\)
ith data sample matrix in a multitask setting
 \(n_i\)
Number of samples in ith data matrix
 \(n_{tot}\)
Total number of samples in a multitask setting
 \(\beta \)
A column of \(\varOmega \)
 \(\beta ^{(i)}\)
A column of \(\varOmega ^{(i)}\)
 \(\mathbf {x}\)
A pdimensional sample
 \(\mathbf {e}_j \)
\( (0,\ldots ,1,\ldots ,0)^T\)
 \(\theta \)
A parameter used in linear programming formulation [Eq. (10)], \([\beta ^{(1)^T}, \ldots , \beta ^{(i)^T},\ldots ,\beta ^{(K)^T},\varepsilon K{(\beta ^s)}^T]^T\)
 \(\mathbf {A}^{(i)}\)
A parameter used in linear programming formulation [Eq. (10)], \([0,\ldots ,0,\varSigma ^{(i)},0,\ldots ,0,\frac{1}{\varepsilon K}\varSigma ^{(i)}]\)
 \(\varvec{b}\)
A parameter used in linear programming formulation [Eq. (10)]. Equals to \(\mathbf {e}_j\)
 Z
A random variable follows the nonparanormal distribution
 \(\mathbf {S}\)
Correlation matrix of Z
 \(\varSigma _{tot}\)
Defined in Sect. 5.1
 \(\varOmega _I\)
Defined in Sect. 5.1
 \(\varOmega _S^{tot}\)
Defined in Sect. 5.1
 \(\varOmega _{tot}\)
Defined in Sect. 5.1
 \(X_{tot}\)
Defined in Sect. 5.1
 \(I_K\)
Defined in Sect. 5.1
 \(\sigma _{ij}\)
An entry of \(\varSigma _{tot}\)
 \(\omega _{ij}\)
An entry of \(\varOmega _{tot}\)
 \(\omega _{j}\)
A column of \(\varOmega _{tot}\)
 \(\hat{\varOmega }_{tot}\)
Estimated \(\varOmega _{tot}\)
 \(\varOmega ^0_{tot}\)
True \(\varOmega _{tot}\)
 \(\omega ^0_j\)
A column of \(\varOmega ^0_{tot}\)
 \(\hat{\varSigma }_{tot}\)
Estimated \(\varSigma _{tot}\)
 \(\varSigma ^0_{tot}\)
True \(\varSigma _{tot}\)
 \(\hat{\omega }_{j}\)
A column of \(\hat{\varOmega }_{tot}\)
 \(\hat{\varOmega }^1_{tot}\)
Solution of Eq. (15)
 \(\lambda _n\)
Hyperparameter for the sparsity in Eq. (7)
 \(\varepsilon \)
Hyperparameter balancing shared and individual part in Eq. (6)
 K
Total number of tasks
 p
Total number of features
 C
 q
A constant between 0 and 1 [Eq. (19)]
 \(\eta \)
A constant between 0 and 0.25 in condition (C1) [Eq. (17)]
 \(\gamma ,\delta \)
Two constants in condition (C2) [Eq. (18)]
 M
A constant represents the upper bound in Eq. (19)
 \(s_0(p)\)
A constant represents the sparsity level of \(\varOmega \) in Eq. (19)
 \(\hat{\mathbf {B}}, \hat{\mathbf {B}}_I, \hat{\mathbf {B}}_S\)
Solution of Eq. (9)
 \(\hat{\varvec{b}}^{(i)}_j\)
A column of \(\hat{\mathbf {B}}_I\)
 \(\hat{\varvec{b}}_j^S\)
A column of \(\hat{\mathbf {B}}_I\)
 \(\tau _0\)
A constant in Theorem 4
 \(C_4,C_5\)
Constants in Theorem 3
 \(C_0,C_1\)
Constants in Theorem 4(a)
 \(\theta _0 \)
Equals to \(\max \nolimits _{i,j,k} \hat{\varSigma }_{j,k}^{(i)}\)
 \(C_2,C_3\)
Constants in Theorem 4(b)
 \(\hat{\sigma }_{ij}\)
An entry of \(\hat{\varSigma }_{tot}\)
 \(\sigma ^0_{ij}\)
An entry of \(\varSigma _{tot}^0\)
 \(C_{K1},C_{K2}\)
Two constants used in the proof of Theorem 4(b)
 \(\mathbf {h}_j \)
\( \hat{\omega }_j  \omega ^0_j\)
 \(\mathbf {h}_j^1 \)
\( (\hat{\omega }_{ij}I\{ \hat{\omega }_{ij}\ge 2t_n \}; 1\le i \le p)^T  \omega _j^0\)
 \(\mathbf {h}_j^2 \)
\( \mathbf {h}_j  \mathbf {h}_j^1\)
 \(\bar{\mathbf {Y}}_{kij} \)
\( \mathbf {X}_{ki}\mathbf {X}_{kj}I\{\mathbf {X}_{ki}\mathbf {X}_{kj}\le \sqrt{n_{tot}/(\log Kp)^3}\}  {\mathbb E}\mathbf {X}_{ki}\mathbf {X}_{kj}I\{\mathbf {X}_{ki}\mathbf {X}_{kj}\le \sqrt{n_{tot}/(\log Kp)^3}\}\)
 \(\check{\mathbf {Y}}_{kij} \)
\( \mathbf {Y}_{kij}  \bar{\mathbf {Y}}_{kij}\)
 \(b_n \)
\( \max _{i,j}{\mathbb E}\mathbf {X}_{ki}\mathbf {X}_{kj}I\{\mathbf {X}_{ki}\mathbf {X}_{kj}\le \sqrt{n_{tot}/(\log Kp)^3}\}\)
 \(\mathbf {Y}_{kij} \)
\( {\mathbf {X}_{tot}}_{ki}{\mathbf {X}_{tot}}_{kj}  {\mathbb E}{\mathbf {X}_{tot}}_{ki}{\mathbf {X}_{tot}}_{kj}\)
 \(\varPhi ^{(i)} \)
The inverse of ith correlation matrix for the nonparanormal case
1 Introduction
Undirected graphical models (UGMs) provide a powerful tool for understanding statistical relationships among random variables. In a typical setting, we can represent the conditional dependency patterns among p random variables \(\{ X_1,\ldots ,X_p \}\) using an undirected graph \(G=(V,E)\). V includes p nodes corresponding to the p variables. E denotes the set of edges describing conditional dependencies among the variables \(\{X_1, \ldots , X_p\}\). If a pair of random variables is conditionally dependent given the rest of variables, there exists an edge in E connecting the corresponding pair of nodes in V; otherwise, the edge is absent. Within the graphical model framework, the task of estimating such undirected graphs based on a set of observed data samples is called structure estimation or model selection. Much of the related literature has focused on estimating G from a given data matrix \(\mathbf {X}_{n\times p}\) (with n observations across p random variables) that are independently and identically drawn from \(N_p(\mu , \varSigma )\). Here \(N_p(\mu , \varSigma )\) represents a multivariate Gaussian distribution with mean vector \(\mu \) (\(\mu \in \mathbb {R}^p\)) and covariance matrix \(\varSigma \) (with size \(p \times p\)). Using the aforementioned G to describe pairwise dependencies among p variables for such a multivariate Gaussian distribution is called a Gaussian graphical model (GGM, e.g., Lauritzen 1996; Mardia et al. 1980). The inverse of the covariance matrix is called the precision matrix, \(\varOmega := \varSigma ^{1}\). Interestingly, GGM’s conditional independence pattern corresponds to zeros of \(\varOmega \). This means that an edge does not connect ith node and jth node (i.e., conditionally independent) in GGM if and only if \(\varOmega _{ij} = 0\).
To achieve a consistent estimation of G, assumptions are usually imposed on the structure of \(\varOmega \). Most commonly, the graph sparsity assumption has been introduced by various estimators to derive sparse GGM (sGGM). The graph sparsity assumption corresponds to limiting the number of nonzero entries in the precision matrix \(\varOmega \), which leads to a combinational problem for structure estimation. Many classic GGM estimators use the \(\ell _1\)norm to create convex relaxation of the combinatorial formulation. For instance, the popular estimator “graphical lasso” (GLasso) has considered maximizing a \(\ell _1\)penalized loglikelihood objective (Yuan and Lin 2007; Banerjee et al. 2008; Hastie et al. 2009; Rothman et al. 2008). More recently, Cai et al. (2011) proposed a constrained \(\ell _1\)minimization formulation for estimating \(\varOmega \), known as the CLIME estimator. CLIME can be solved through columnwise linear programming and has shown favorable theoretical properties. Moreover, the nonparanormal graphical models (NGM) recently proposed by Liu et al. (2012) have extended sGGM to new distribution families. Both GGM and NGM belong to the general family of UGM (reviewed in Sect. 4).
This paper focuses on the problem of jointly estimating K undirected graphical models from K related multivariate sample blocks. Each sample block contains a different set of data observations on the same set of variables. This task is motivated by the fact that the past decade has seen a revolution in collecting largescale heterogeneous data from many scientific fields like genetics and brain science. For instance, genomic technologies have delivered fast and accurate molecular profiling data across many cellular contexts (e.g. cell lines or cell stages) (ENCODE Project Consortium 2011). Many neuroimaging studies have collected functional measurements of brain regions across a cohort of multiple human subjects (Di Martino et al. 2014). Such networks can be concerned with identifying subjectspecific variations across a population, where each individual is a unique context. For this type of data, understanding and quantifying contextspecific variations across multiple graphs is a fundamental analysis task. Figure 1 provides a simple illustration (with two contexts) of the target problem. Interaction patterns that are activated only under a specific context can help to understand or to predict the importance of such a context (Ideker and Krogan 2012; Kelly et al. 2012).
Prior approaches for estimating UGMs from such heterogeneous data tend to either only estimate pairwise differential patterns between two graphs or jointly estimate multiple sGGMs toward a common graph pattern (reviewed in Sect. 4). The former strategy does not exploit the shared network structure across contexts and is not applicable for more than two contexts, leading to undesirable effects on the quality of the estimated networks. Conversely, the latter approach underestimates the network variability and makes implicit assumptions to minimize intercontext differences which are difficult to justify in practice. This is partly caused by the fact that relevant studies have mostly extended the “graphical lasso” (GLasso) estimator to multitask settings and followed a penalized loglikelihood formulation [Eq. (2)]. Under the GLasso framework, however, explicitly quantifying the contextspecific substructures involves a very challenging optimization task (explained in detail in Sect. 4).

Novel model Using a constrained \(\ell \)1 optimization strategy (Sect. 2), SIMULE extends CLIME to a multitask setting. The learning step is solved efficiently through a formulation of multiple independent subproblems of linear programming (Sect. 2.4) for which we also provide a parallel version of the learning algorithm. Compared with previous multitask sGGM models, SIMULE can accurately quantify taskspecific network variations that are unique for each task. This also leads to a better generalization and benefits all the involved tasks.

Novel extension Furthermore, since most realworld datasets do not follow the normality assumption, we extend SIMULE to Nonparanormal SIMULE (NSIMULE in Sect. 3) by learning multiple NGM under the proposed \(\ell \)1 constrained minimization. NSIMULE can deal with nonGaussian data that follow the nonparanormal distribution (Liu et al. 2009), a much more generic data distribution family. Fitting NSIMULE is computationally as efficient as SIMULE.

Theoretical convergence rate In Sect. 5, we theoretically prove that SIMULE and NSIMULE achieve a consistent estimation of the target (true) dependency graphs with a high probability at the rate \(O(\log (Kp)/n_{tot})\). Here \(n_{tot}\) represents the total number of samples from all tasks and K describes the number of tasks. This proof also theoretically validates the benefit of learning multiple sGGMs jointly (Sect. 7), since the \(O(\log (Kp)/n_{tot})\) convergence rate is better than learning multiple singlesGGMs separately at rate \(O(\log p/n_i)\). \(n_{i}\) represents the number of samples of ith task. Such an analysis hasn’t been provided in any of the previous multisGGM studies.

Performance improvement In Sect. 6 we show a strong improved performance of SIMULE and NSIMULE over multiple baseline methods on multiple synthetic datasets and two realworld multicell biomedical datasets. The proposed methods obtain better AUC and partial AUC scores on all simulated cases. On two realworld datasets, our methods find the most matches of variable interactions when using existing BioMeddatabases for validation.
2 Method: SIMULE
Learning multiple UGM jointly is a task of interests in many applications. This paper tries to model and learn contextspecific graph variations explicitly, because such variations can “fingerprint” important markers for fields like cognition (Ideker and Krogan 2012), physiology (Kelly et al. 2012) or pathology (Ideker and Krogan 2012; Kelly et al. 2012). We consider the general case of estimating K graphical models from a pdimensional aggregated dataset in the form of K different data blocks.
In what follows, plain letters denote scalars. Uppercase and lowercase bold letters denote matrices and vectors respectively.^{1} We denote \(\mathbf {X}^{(i)}_{n_i \times p}\) as the ith data block (or data matrix). The total number of data samples uses notation \(n_{tot} =\sum \nolimits _{i=1}^{K}n_i\). The precision matrix uses notation \(\varOmega \) and the covariance matrix uses notation \(\varSigma \). We denote the correlation matrix as S and the inverse of correlation matrix as \(\varPhi \). The vector \(\mathbf {e}_j = (0,\ldots ,1,\ldots ,0)^T\) denotes a basis vector in which only the jth entry is 1 and the rest are 0. For a pdimensional data vector \(\mathbf {x}= (x_1, x_2, \ldots , x_p)^T \in \mathbb {R}^p\), let \(\mathbf {x}_1 = \sum \nolimits _ix_i\) be the \(\ell _1\)norm of \(\mathbf {x}\), \(\mathbf {x}_{\infty } = \max \nolimits _ix_i\) be the \(\ell _{\infty }\)norm of \(\mathbf {x}\) and \(\mathbf {x}_2 = \sqrt{\sum \nolimits _i x_i^2}\) be the \(\ell _2\)norm of\(\mathbf {x}\). Similarly, for a matrix \(\mathbf {X}\), let \(\mathbf {X}_1 = \sum \nolimits _{i,j}\mathbf {X}_{i,j}\) be the \(\ell _1\)norm of \(\mathbf {X}\) and \(\mathbf {X}_{\infty } = \max \nolimits _{i,j}\mathbf {X}_{i,j}\) be the \(\ell _{\infty }\)norm of \(\mathbf {X}\). \(\mathbf {X}_2 = \sqrt{\lambda _{\max }(\mathbf {X})}\), here \(\lambda _{\max }\) is the largest eigenvalue of \(\mathbf {X}\). \(\mathbf {X}_F = \sqrt{\sum \nolimits _{i,j}\mathbf {X}_{i,j}^2}\) is the Fnorm of \(\mathbf {X}\). \(\mathbf {X}_\mathbf{1 } = \max \nolimits _{j} \sum \nolimits _i \mathbf {X}_{ij}\) is the matrix \(\mathbf 1 \)norm of \(\mathbf {X}\). \(\mathbf {X}^{(1)}, \mathbf {X}^{(2)}, \ldots , \mathbf {X}^{(K)}_{1,p} = \sum \nolimits _{i}\mathbf {X}^{(i)}_p\) is the \(\ell _{1,p}\)norm of \((\mathbf {X}^{(1)}, \mathbf {X}^{(2)}, \ldots , \mathbf {X}^{(K)})\). \(\varOmega \succ 0\) means that \(\varOmega \) is a positive definite matrix. \((\varOmega ^{(1)}, \varOmega ^{(2)}, \ldots , \varOmega ^{(K)})_{\mathcal {G},2} = \sum \nolimits _{j = 1}^p \sum \nolimits _{k = 1}^p (\varOmega _{j,k}^{(1)}, \varOmega _{j,k}^{(2)}, \ldots , \varOmega _{j,k}^{(i)}, \ldots , \varOmega _{j,k}^{(K)})_2\). \((\varOmega ^{(1)}, \varOmega ^{(2)}, \ldots , \varOmega ^{(K)})_{\mathcal {G},\infty } = \sum \nolimits _{j = 1}^p \sum \nolimits _{k = 1}^p (\varOmega _{j,k}^{(1)}, \varOmega _{j,k}^{(2)}, \ldots , \varOmega _{j,k}^{(i)}, \ldots , \varOmega _{j,k}^{(K)})_\infty \).
2.1 Background: CLIME for estimating sparse Gaussian graphical model
2.2 Background: multitask learning with taskshared and taskspecific parameters
2.3 SIMULE: infer Shared and Individual parts of MULtiple sGGM Explicitly
Treating sparse GGM estimation from each data block as a single task, our main goal is to learn multiple sGGMs over K tasks jointly, which can lead to a better generalization across all of the involved tasks (theoretically proven in Sect. 5).
2.4 Optimization
To solve Eq. (11), we follow the primal dual interior method (Boyd and Vandenberghe 2004) that has also been used in the Dantzig selector for the task of regression (Candes and Tao 2007). Other strategies can be used to solve this linear programming, such as the one used in Pang et al. (2014).
2.5 Parallel version of SIMULE
Algorithm 1 can easily be revised into a parallel version. Essentially we just need to revise the “For loop” of step 8 in Algorithm 1 into, for instance, “column per machine” or “column per core”. Since the calculation of each column is independent from the other columns, this parallel variation will obtain the same solution as Algorithm 1 at a better speed. Section 6 shows the speed improvements of SIMULE in a multicore setting versus a singlecore setting.
3 Method variation: nonparanormal SIMULE (NSIMULE)
Though sGGM is powerful, its normality assumption is commonly violated in real applications. For instance, for the TF ChIPSeq data analyzed in Sect. 6.6, the histogram of one of its TF variables is clearly not following Gaussian distribution (across samples, shown as the right distribution graph in Fig. 2). After a univariate logtransformation of the same feature, we obtain its distribution histogram as the left graph in Fig. 2. The transformed data samples are approximately normally distributed. This motivates us to adopt a more generalized UGM (recently proposed in Liu et al. 2009) to overcome the limitation of sGGM. This socalled “nonparanormal graphical model” (Liu et al. 2009) assumes that data samples follow a multivariate nonparanormal distribution, which is a strict superset of the Gaussian distribution. We extend SIMULE to the nonparanormal family and name this novel variation NSIMULE. NSIMULE learns to fit multiple NGMs jointly through modeling taskshared and taskspecific parameters explicitly.
3.1 Background: nonparanormal graphical model
3.2 Background: estimate \(\mathbf {S}\) through rankbased measures of correlation matrix \(\mathbf {S}_0\)
Since the direct estimation of covariance matrix \(\mathbf {S}\) is difficult in nonparanormal distribution, recent studies have proposed an efficient nonparametric estimator (Liu et al. 2009) for \(\mathbf {S}\). This estimator is derived from the correlation matrix \(\mathbf {S}_0\). Because the covariance matrix \(\mathbf {S}= diag(\mathbf {S}_i) \mathbf {S}_0 diag(\mathbf {S}_i)\), \(\mathbf {S}^{1} = diag(\mathbf {S}_i)^{1} \mathbf {S}_0^{1} diag(\mathbf {S}_i)^{1}\). Here \(\mathbf {S}_i = \sqrt{Cov(Z_i,Z_i)}\) and \(diag(\mathbf {S}_i) = diag(\mathbf {S}_1,\mathbf {S}_2,\ldots ,\mathbf {S}_p)\). Therefore, the inverse of correlation matrix (\(\mathbf {S}_0^{1}\)) and the inverse of covariance matrix (\(\mathbf {S}^{1}\)) have the same nonzero and zero entries. Based on this observation, Liu et al. (2009) proposed a nonparametric method to estimate the correlation matrix \(\mathbf {S}_0\), instead of estimating the covariance matrix \(\mathbf {S}\) for the purpose of structure inference.
In Liu et al. (2009) the authors proposed using the population Kendall’s tau correlation coefficients \(\tau _{jk}\) to estimate \(\mathbf {S}_0\), based upon the explicit relationship between this rankbased measure \(\tau _{jk}\) and the correlation measure \((\mathbf {S}_{jk})_0\) for a given nonparanormal dataset \(Z \sim NPN_{p}(\mu , \mathbf {S},f_{1},\ldots ,f_{p})\) (discussed in Liu et al. 2012). Figure 2 presents the simple relationship between \(Z \sim NPN_{p}(\mu ,S;f_{1},\ldots ,f_{p})\) and its latent \(X \sim N(\mu ,S)\). To simplify notations, we use \(\mathbf {S}\) to represent the correlation matrix for the remainder of this paper.
Theorem 1
Proof
The proof is provided in Liu et al. (2009). \(\square \)
3.3 NSIMULE: nonparanormal SIMULE
We can now substitute each sample covariance matrix \(\hat{\varSigma }^{(i)}\) used in Eq. (9) from each task with its corresponding correlation matrix \(\mathbf {S}^{(i)}\) as estimated above. The rest of the computations are the same as SIMULE. We refer to this whole process as NSIMULE. It estimates multiple different, but related, sparse Nonparametric Graphical Models (sNGM) through shared and taskspecific parameter representations.
Theorem 2
If X, Y are two independent random variables and f,g \(:\mathbb {R}\rightarrow \mathbb {R}\) are two measurable functions, then f(X) and g(Y) are also independent.
Through the above theorem, the monotone functions f in \(NPN_{p}\) will not change the conditional dependency among variables. As proved in Liu et al. (2009), the conditional dependency network among the latent Gaussian variables X (in this \(NPN_{p}\)) is the same as the conditional dependency network among the nonparanormal variables \(Z_i\), with a parametric asymptotic convergence rate. Therefore, we can use the estimated correlation matrices \(\mathbf {S}^{(i)}\) for the joint network inference of multiple sNGMs in SIMULE. This is also because we have shown that the inverse of the correlation matrix and the inverse of the covariance matrix share the same nonzero and zero patterns.
4 Related work
4.1 Connecting to past multisGGM studies
Sparse GGM is an extremely active topic in the recent literature including notable studies like Wainwright and Jordan (2006) and Banerjee et al. (2008). We can categories singletask sGGM estimators into three groups: (a) penalized likelihood (GLasso), (b) neighborhood approach and (c) CLIME estimator.
We choose the most relevant three studies as our baselines in the experiments: (a) Fused Joint graphical lasso (JGLfused) (Danaher et al. 2013), (b) Group Joint graphical lasso (JGLgroup) (Danaher et al. 2013) and (c) SIMONE (Chiquet et al. 2011). JGLfused and JGLgroup are based on the popular “graphical lasso” estimator (Friedman et al. 2008; Yuan and Lin 2007); [using \(L(\varOmega ) =(\log det(\varOmega )  <\varSigma , \varOmega >)\) in Eq. (13)]. SIMONE (Chiquet et al. 2011) follows neighborhoodselection based estimator. It can be viewed as using a pseudolikelihood approximation instead of the full likelihood as \(L(\varOmega )\) in Eq. (13).^{2}
A list of representative multisGGM methods and the second penalty functions they have used
References  Penalty function \(P(\varOmega ^{(1)}, \varOmega ^{(2)}, \ldots ,\varOmega ^{(K)}) =\)  

(1)  JGLFused (Danaher et al. 2013)  \( \sum \nolimits _{ij,i>j}\varOmega ^{(i)}  \varOmega ^{(j)}_1\) 
(2)  JGLGroup (Danaher et al. 2013)  \( \varOmega ^{(1)},\varOmega ^{(2)}, \ldots , \varOmega ^{(K)}_{\mathcal {G},2}\) 
(3)  SIMONE (Chiquet et al. 2011)  \(\sum \nolimits _{i \ne j} \left( \left( \sum \nolimits _{k=1}^T\left( \varOmega _{ij}^{(k)}\right) _{+}^{2}\right) \right) ^{\frac{1}{2}}+\left( \left( \sum \nolimits _{k=1}^K\left( \varOmega _{ij}^{(k)}\right) _{+}^{2}\right) \right) ^{\frac{1}{2}}\) 
(4)  Node JGL (Mohan et al. 2013)  \(\sum \nolimits _{ij,i>j}RCON(\varOmega ^{(i)}  \varOmega ^{(j)})\) 
(5)  JEMGM (Guo et al. 2011)  \(\sum \nolimits _{k = 1}^K w_k\varOmega ^{(k)}_1 \) 
(6)  MTLGGM (Honorio and Samaras 2010)  \( \varOmega ^{(1)},\varOmega ^{(2)}, \ldots , \varOmega ^{(K)}_{\mathcal {G},\infty }\) 
(7)  CSSLGGM (Hara and Washio 2013)  \(\varOmega _S_1 + \varOmega ^{(1)}_I, \varOmega ^{(2)}_I, \ldots , \varOmega ^{(K)}_I_{1,p}\) 
In addition to these three baselines, a number of recent studies also perform multitask learning of sGGM (Honorio and Samaras 2010; Guo et al. 2011; Zhang and Wang 2012; Zhang and Schneider 2010; Zhu et al. 2014). They all follow the same formulation as Eq. (13) but explore a different second penaltyfunction \(P(\varOmega ^{(1)}, \varOmega ^{(2)}, \ldots ,\varOmega ^{(K)})\). As an example, Nodebased JGL proposed a novel penalty, namely RCON (Mohan et al. 2013) (shown as the 4th row of the Table 1) or “rowcolumn overlap norm” for capturing special relationship among graphs. In two recent works, the penalty function at the 5th row is Table 1 has been used by Guo et al. (2011) and the penalty function at the 6th row of Table 1 has been used by Honorio and Samaras (2010).
Furthermore, there exist studies that explore similar motivations as ours when learning multiple GGM models from data. (a) Han et al. (2013) proposed to estimate a population graph from multiblock data using a socalled “mediangraph” idea. It is conceptually similar to \(\varOmega _S\). However, they do not have \(\varOmega ^{(i)}_I\) to model individual parts that are specific to each task. (b) Another recent study, CSSLGGM Hara and Washio (2013) also tried to model both the shared and individual substructures in multisGGMs. Different from ours, their formulation is within the penalized likelihood framework as Eq. (13). They used \(\ell \)1,p norm (see last row of Table 1) to regularize the taskspecific parts, while SIMULE uses \(\ell \)1 norm instead in Eq. (6). The \(\ell \)1,p norm pushes the individual parts of multiple graphs to be similar which is contradictory to the original purpose of these parameters.^{3} (c) More recently, Monti et al. (2015) proposed to learn population and subjectspecific brain connectivity networks via a socalled “Mixed Neighborhood Selection” (MSN) method. Following the neighborhood selection framework (Meinshausen and Bühlmann 2006), for each node v, MSN tried to learn the neighborhood of each v. Similar to SIMULE, they estimated the neighborhood edges of a given node v in the itask as \(\beta ^v + \widetilde{\varvec{b}}^{(i),v}\). Here \(\beta ^v\) represents the neighbor in the shared part and \(\widetilde{\varvec{b}}^{(i),v}\) represents the neighbors that are specific to the ith graph. Since MSN is specially designed for brain imaging data, it assumes each individual graph is generated by random effects, i.e., \(\widetilde{\varvec{b}}^{(i),v} \sim N(0,\varPhi ^v)\). SIMULE does not have such strong assumptions on either taskspecific or taskshared substructures. Our model is more general while MSN is designed for brain imaging data. (d) Another line of related studies (Liu et al. 2013; Sugiyama et al. 2013; Fazayeli and Banerjee 2016) prosed densityratio based strategies to estimate a differential graph between two graphs. Even though this group of methods can handle the unbalance dataset (i.e., the numbers of samples in two datasets are quite different), they can only capture the difference between two graphs (\(K=2\)). SIMULE does not have such a limitation on the number of tasks. (e) Moreover, several loosely related studies exist in settings different from ours. For example, for handling highdimensional time series data a few recent papers have considered exploring multiple sGGMs by modeling relationships among networks; e.g., Kolar et al. (2010), Qiu et al. (2013).
4.2 Biasing covariance matrices with SIMONEI and SIMULEI
The SIMONE package (Chiquet et al. 2011) has introduced “intertwined Lasso” (named as “SIMONEI” in the rest of paper) that takes the perspective of sharing the information among covariance matrices. More specifically, this variation averages each task’s sample covariance matrix with a global empirical covariance matrix obtained from the whole dataset. Motivated by SIMONEI, we extend SIMULE with a similar strategy that revises each task’s sample covariance matrix with \(\tilde{ \varSigma }^{(i)} = \alpha \varSigma ^{(i)} + (1\alpha ) n_{tot}^{1}\sum _{t=1}^K {n_t\varSigma ^{(t)}} \). Here \(\alpha = 0.5\) . This variation is referred as “SIMULEI” for the rest of this paper. We report experimental results from both SIMONEI and SIMULEI in Sect. 6.
Fan et al. (2014) have pointed out that sample covariance matrix is inconsistent under a highdimensional setting. There exists a huge body of previous literature for covariance matrices estimation. Roughly relevant studies can be grouped into three types: (a) Sparse covariance matrix estimation, including studies like hardthresholding (Lam and Fan 2009), softthresholding (Tibshirani 1996), smoothly clipped absolution deviation (SCAD, Fan et al. 2013), and minimax concavity penalties (Zhang 2010). These methods, though simple, do not guarantee the positive definiteness of the estimated covariance matrix. (b) Positive definite sparse covariance matrix estimation that introduces penalty functions on the eigenvector space. Popular studies include Antoniadis and Fan (2011), Liu et al. (2014), Levina et al. (2008), Rothman (2012). (c) Factormodel based covariance matrix estimation like POET (Fan et al. 2013) that applies softthresholding to the residual space obtained after applying an approximate factor structure on the estimated covariance matrix. Most of these studies can be used to extend SIMULE. We leave a more thorough study of such combinations as future work.
4.3 Penalized loglikelihood for SIMULE
Comparing Eq. (9) of SIMULE with Eq. (13), previous multisGGM approaches mostly relied on penalized loglikelihood functions for learning. We have also considered extending penalized loglikelihood method such as the “graphical lasso” estimator into our MTL setting: \(\varOmega ^{(i)} = \varOmega ^{(i)}_I + \varOmega _S\). However, it is hard to deal with the log determinant term \(logdet(\varOmega ^{(i)}_I + \varOmega _S)\) in optimization. This is because \(\frac{\partial logdet(\mathbf {X}+ \mathbf {Y})}{\partial \mathbf {X}} = (\mathbf {X}+ \mathbf {Y})^{1}\) and it is difficult to calculate the inverse of \(\mathbf {X}+\mathbf {Y}\). There is no closed form for the inverse of the sum of two matrices, except for certain special cases. From the perspective of optimization, it is hard to directly use methods like coordinate descent for learning such a model due to this first derivative issue. The authors of CSSLGGM (Hara and Washio 2013) handled the issue by using Alternating Direction Method of Multipliers (ADMM) . This optimization is much more complicated than SIMULE. Another package MNS (Monti et al. 2015) tackled this issue partly through adding latent variables. It assumes that each individual part \(\widetilde{\varvec{b}}^{(i),v} \sim N(0,\varPhi ^v)\), in which \(\varPhi ^v\) is a latent variable learned by the EM algorithm.
4.4 Relevant studies using \(\ell _1\) optimization with constraints
Two categories of relevant studies: learning with “penalized loglikelihood” or learning with “\(\ell \)1 constrainedoptimization”
Tasks  Penalized likelihood  \(\ell \)1 constrainedoptimization 

High dimensional linear regression  Lasso:  Dantzig selector: 
\({\mathop {\hbox {argmin}}\nolimits _{\beta }}\mathbf {Y} \beta \mathbf {X}_F + \lambda \beta _1\)  \({\mathop {\hbox {argmin}}\nolimits _{\beta }} \beta _1\) subject to : \(\mathbf {X}^T\mathbf {y}\mathbf {X}^T\mathbf {X}\beta _{\infty }\le \lambda \)  
sGGM  GLasso:  CLIME: 
\({\mathop {\hbox {argmin}}\nolimits _{\varOmega \ge 0}} logdet(\varOmega ) + <\varOmega ,\varSigma > + \lambda \varOmega _1\)  \({\mathop {\hbox {argmin}}\nolimits _{\varOmega \ge 0}} \varOmega _1\) subject to: \(\varOmega \varSigma  I_{\infty } \le \lambda \)  
Multitask learning of sGGM  Different penalty:  Our SIMULE: 
\({\mathop {\hbox {argmin}}\nolimits _{\varOmega ^{(i)} > 0}} \sum \limits _i (L(\varOmega ^{(i)}) + \lambda _1 \sum \limits _i \varOmega ^{(i)}_1 + \lambda _2 P(\varOmega ^{(1)}, \varOmega ^{(2)}, \ldots ,\varOmega ^{(K)}))\)  \({\mathop {\hbox {argmin}}\nolimits _{\varOmega ^{(i)}_I,\varOmega _S}}\sum \limits _i \left \left \varOmega ^{(i)}_I\right \right _1+ \varepsilon K\varOmega _S\) subject to: \( \varSigma ^{(i)}(\varOmega ^{(i)}_I + \varOmega _S)  I_{\infty } \le \lambda _{n}, \; i = 1,\ldots ,K \) 
4.5 Optimization and computational concerns
Furthermore, a number of papers have focused on proposing ways to improve the performance of computation and data storage when estimating sGGMs. For example, the BigQUIC algorithm (Hsieh et al. 2011) aims at an asymptotic quadratic optimization when estimating sGGM. The authors of the Nodebased JGL method (Mohan et al. 2013) proposed a blockseparate method to improve the computational performance of multisGGMs. Our method can be extended to a parallel setting, since it can be naturally decomposed into per column based optimization. Nodebased JGL (Mohan et al. 2013) has also used the ADMM optimization algorithm for learning. Compared to linearprogrammingbased SIMULE, Mohan et al. (2013) has two main disadvantages: (a) a large number of complex optimizations are required to be solved, and (b) the condition of convergence is unspecified. SIMULE makes use of an efficient algorithm, is proven to achieve good numerical performance, and ensures convergence (see Sect. 5).
Furthermore, since imposing an \(\ell _1\) penalty on the model parameters formulates the structure learning of UGMs as a convex optimization problem, this strategy has shown successful results for modeling continuous data with GGM or NGM and discrete data with the pairwise Markov Random Fields (MRFs) (Friedman et al. 2008; Höfling and Tibshirani 2009). However, the discrete case of pairwise MRF is much harder because of the potentially intractable normalizing constant and also the possibility that each edge may have multiple parameters. One more complicating factor about structure learning is that the pairwise assumption might need a few exceptions, like searching for higherorder combinatorial interactions in recent studies like Schmidt and Murphy (2010), Buchman et al. (2012). (Detailed descriptions of related work are not included due to space limitation.)
5 Theoretical Analysis
5.1 Theoretical Analysis for Basic SIMULE
Lemma 1
Proof
If \(ab<0\), let \(c = a + b\) and \(d = 0\). \(cd = 0 \ge 0\) and \(c+d = a+b\). Therefore, \(c + \varepsilon d = a+b<a < a + \varepsilon b\). This contradicts a, b are the optimal solution of Eq. (14). \(\square \)
Corollary 1
Assume \(\widehat{\varOmega }_I^{(i)}\) and \(\widehat{\varOmega }_S\) are the optimal solution of Eq. (7), then
\((\varOmega _S + \varOmega _I^{(i)})_1 = \varOmega _S_1 + \varOmega _I^{(i)}_1\)
Proof
By Lemma 1, we have that \(\widehat{\varOmega }_{I,j,k}^{(i)}\widehat{\varOmega }_{S,j,k} \ge 0\), if \(\widehat{\varOmega }_I^{(i)}\) and \(\widehat{\varOmega }_S\) are the optimal solution of Eq. (7). \(\square \)
We use \(\varSigma _{tot}^0\) to represent the true value of \(\varSigma _{tot}\) and \(\widehat{\varSigma }_{tot}\) as the estimated. We also use \(\varOmega ^0_{tot} = (\omega _1^0,\omega _2^0,\ldots ,\omega _{Kp}^0)\) to describe the true \(\varOmega _{tot}\) and \(\widehat{\varOmega }^1_{tot} = (\hat{\omega }^1_{ij})\) to denote the solution of the optimization problem in Eq. (15). The final solution is denoted as \(\widehat{\varOmega }_{tot} := (\hat{\omega }_{ij}) = (\hat{\omega }_1,\hat{\omega }_2,\ldots ,\hat{\omega }_{Kp})\) where \(\hat{\omega }_{ij} = \hat{\omega }_{ji} = \hat{\omega }^{1}_{ij}\mathop {\mathrm {sign}}( \max (\hat{\omega }^{1}_{ij} \hat{\omega }^{1}_{ji},0)) + \hat{\omega }^{1}_{ji} \mathop {\mathrm {sign}}( \max (\hat{\omega }^{1}_{ji} \hat{\omega }^{1}_{ij}, 0))\). Furthermore, we denote \({\mathbb E}[X^{(i)}]\) as \((\mu _1^{(i)},\mu _2^{(i)}, \ldots ,\mu _p^{(i)})^T\).
Lemma 2
Lemma 3
If each \(\varOmega _I^{(i)} + \varOmega _S\) satisfies Condition Eq. (19), then \(\varOmega _{tot}\) also satisfies Condition Eq. (19).
Proof
use the definition of \(\varOmega _{tot}\). \(\square \)
Corollary 2
\(\widehat{\varOmega }_{tot}\) satisfies the condition \(\widehat{\varOmega }_{tot} \succ 0\), with a high probability.
Theorem 3
Theorem 4
Suppose that \(\varOmega ^0_{tot} \in \mathcal {U}(q,s_0(p))\) and \(n_{tot}=\sum _{i=1}^K n_i\).
Through Eq. (23)–(29), we theoretically prove that we can achieve a good estimation of target dependency graphs with the convergence rate \(O(\log (Kp)/n_{tot})\). Based on CLIME (Cai et al. 2011), the convergence rate of singletask sGGM is \(O(\log p/n_i)\). Here \(n_{i}\) represents the number of samples of ith task. Assuming \(n_i = \frac{n_{tot}}{K}\), the convergence rate of single sGGM is \(O(K\log p/n_{tot})\). Clearly, since \(K\log p > \log (Kp)\), the convergence rate of SIMULE is better than singletask sGGM. This provides theoretical proofs for the benefit of multitasking sGGM. Neither of these theoretical results have been investigated by the previous studies.
All proofs of above theorems are provided in Sect. 7.
5.2 Theoretical analysis for NSIMULE estimator
In this subsection, we investigate the theoretical properties of NSIMULE estimator and prove that its convergence rate is the same as the SIMULE.
Theorem 5
Based on Liu et al. (2009), Suppose we use the estimated Kendall’s tau correlation matrix \(\widehat{\mathbf {S}}\) to replace \(\widehat{\varSigma }\) in the parametric GLasso estimator. Then under the same conditions on \(\widehat{\varSigma }\) (that ensure the consistency of the estimator under the Gaussian model), the nonparanormal estimator achieves the same (parametric) rate of convergence as GLasso estimator for both the precision matrix estimation and the graph structure recovery.
Theorem 6
We use the estimated Kendall’s tau correlation matrix \(\hat{\mathbf {S}}^{(i)}\) in the Nonparanormal SIMULE estimator. Then under the same conditions on \(\hat{\varSigma }^{(i)}\) (that ensure the consistency of the estimator under the Gaussian model), the nonparanormal SIMULE achieves the same rate of convergence as SIMULE (\(O(\sqrt{\log (Kp)/n_{tot}})\)) for both the graph recovery and precision matrix estimation.
5.3 Potential nonidentifiability issue
Linear programming is not strongly convex. Therefore, there may be multiple ideal solutions in the SIMULE formulation of Eq. (7) (i.e., identifiability problem). In fact, the CLIME (Cai et al. 2011) estimator may also have multiple optimal solutions. Cai et al. (2011) have proved all such solutions converge to the true one at an optimal convergence rate. Similarly, in Sect. 5.1, we have proved that SIMULE formulation in Eq. (7) may result in multiple optimal solutions \(\{ \hat{\varOmega }_{tot} \}\). Each of these solutions \(\hat{\varOmega }_{tot}\) converges to the true solution with an optimal convergence rate. We present Theorem 7 showing that for each optimal solution \(\hat{\varOmega }_{tot}\), when \(\varepsilon \ne 1\), we can obtain unique estimation of \(\varOmega _S\) and \(\{ \varOmega _I^{(i)}  i = 1,\ldots ,K\}\).
Theorem 7
When we pick \(\varepsilon > 0\) and \(\varepsilon \ne 1\), for each optimal solution \(\hat{\varOmega }_{tot}\) from Eq. (7), there exist unique \(\varOmega _S\) and \(\{ \varOmega _I^{(i)}  i = 1,\ldots ,K\}\) satisfying Eq. (5).
We provide the proof of Theorem 7 in Sect. 7.
In practice, we need to decide \(\varepsilon \) according to the application for which SIMULE is used. For example, for the genomerelated biomedical data, we can normally assume that the shared subgraph is more dense than individual interactions of each context.^{4} Therefore we pick \(\varepsilon < 1\) to reflect this assumption in our experiments in Sect. 6.
6 Experiments
AUC and partial AUC on simulated Gaussian datasets from Model 1
AUC  AUCindividual  AUCshared  AUCFPR \(\le \) 20%  AUCFPR \(\le \) 5%  AUC \(p = 200\)  

Gaussianmodel1[\(K=3\)][\(p=100\)]  
NSIMULE  0.9872  0.8408  0.8964  0.1599  0.0188  0.9959 
SIMULE  0.9844  0.8379  0.8788  0.1587  0.0179  0.9945 
JGLfused  0.6843  0.5666  0.9817  0.0989  0.0094  0.6745 
JGLgroup  0.5162  0.4988  0.5759  0.0908  0.0174  0.5122 
SIMONE  0.7748  0.5124  0.9321  0.0992  0.0171  0.5488 
CLIME  0.6509  0.5197  0.7795  0.0439  0.0001  0.5422 
NCLIME  0.5400  0.4999  0.8224  0.0434  0.0001  0.5216 
SIMONEI  0.8041  0.6740  0.9681  0.1030  0.0177  0.6122 
SIMULEI  0.9979  0.8594  0.9249  0.1604  0.0183  0.9984 
AUC and partial AUC on simulated nonparanormal datasets from Model 1
AUC  AUCindividual  AUCshared  AUCFPR \(\le \) 20%  AUCFPR \(\le \) 5%  AUC \(p = 200\)  

Nonparanormalmodel1[\(K = 3\)][\(p=100\)]  
NSIMULE  0.8172  0.8408  0.8964  0.1599  0.0188  0.8325 
SIMULE  0.7322  0.8165  0.8788  0.1567  0.0173  0.7745 
JGLfused  0.6942  0.6362  0.9817  0.0993  0.0124  0.7308 
JGLfusednonparanormal  0.6978  0.6374  0.9896  0.1012  0.0137  0.7343 
JGLgroup  0.5181  0.4942  0.8050  0.0487  0.0064  0.5416 
JGLgroupnonparanormal  0.6500  0.5563  0.8614  0.0868  0.0113  0.5498 
SIMONE  0.7198  0.5061  0.9321  0.1080  0.0102  0.5671 
SIMONEnonparanormal  0.7271  0.5072  0.9327  0.1146  0.0135  0.5766 
CLIME  0.5803  0.5108  0.7745  0.0427  0.0001  0.5127 
NCLIME  0.5298  0.4935  0.8224  0.0398  0.0001  0.4915 
6.1 Experimental settings
6.1.1 Baselines
We compare SIMULE, SIMULEI and NSIMULE with the following baselines: (1) Three different multisGGM estimators including JGLfused, JGLgroup (Danaher et al. 2013), and SIMONE (Chiquet et al. 2011) (with the penalty functions described in Table 1); (2) The singletask CLIME baseline (i.e. each task uses CLIME independently); (3) The singletask nonparanormal CLIME (NCLIME) baseline (i.e., each task uses NCLIME independently). (4) The nonparanormal extension of JGLfused, JGLgroup, and SIMONE.^{5} (5) SIMONEI (“intertwined Lasso” in SIMONE package) is added as a baseline for comparison with SIMULEI.^{6}
6.1.2 Metric
AUC and partial AUC on simulated Gaussian datasets from Model 2
AUC  AUCindividual  AUCshared  AUCFPR \(\le \) 20%  AUCFPR \(\le \) 5%  AUC \(p = 200\)  

Gaussianmodel2[\(K=2\)]  
NSIMULE  0.9997  0.8095  0.9727  0.1997  0.0497  1.0000 
SIMULE  0.9996  0.8391  0.9697  0.1997  0.0497  0.9998 
JGLfused  0.9991  0.4893  0.9983  0.1991  0.0491  0.9993 
JGLgroup  0.9999  0.5000  0.7715  0.1999  0.0499  0.9866 
SIMONE  0.9989  0.7632  0.9982  0.1989  0.0489  0.9990 
CLIME  0.5077  0.3948  0.7517  0.0404  0.0025  0.5037 
NCLIME  0.4995  0.4043  0.7614  0.0402  0.0025  0.5022 
SIMONEI  0.9995  0.7692  0.9986  0.1995  0.0499  0.9951 
SIMULEI  0.9997  0.8704  0.9997  0.1999  0.0499  1.0000 
AUC and partial AUC on simulated nonparanormal datasets from Model 2
AUC  AUCindividual  AUCshared  AUCFPR \(\le \) 20%  AUCFPR \(\le \) 5%  AUC \(p = 200\)  

Nonparanormalmodel2[\(K=2\)]  
NSIMULE  0.9993  0.8095  0.9727  0.1993  0.0493  1.0000 
SIMULE  0.9993  0.8453  0.9609  0.1929  0.0419  0.9996 
JGLfused  0.9984  0.5117  0.9984  0.1984  0.0424  0.9996 
JGLfusednonparanormal  0.9990  0.5641  0.9998  0.1986  0.0425  0.9996 
JGLgroup  0.9784  0.6791  0.9151  0.1967  0.0464  1.0000 
JGLgroupnonparanormal  0.9899  0.6948  0.9391  0.1969  0.0465  1.0000 
SIMONE  0.9991  0.7529  0.9911  0.1990  0.0491  0.9983 
SIMONEnonparanormal  0.9992  0.7960  0.9948  0.1993  0.0492  0.9986 
CLIME  0.4985  0.3740  0.7517  0.0238  0.0001  0.5041 
NCLIME  0.4994  0.4042  0.7614  0.0203  0.0001  0.5022 
6.1.3 Selection of hyperparameter \(\lambda _{n}\)
Recent research studies from Negahban et al. (2009) and Yang et al. (2014) conclude that the regularization parameter \(\lambda \) of a singletask sGGM (e.g., with \(n_i\) samples) should satisfy \(\lambda \propto \sqrt{\frac{\log p}{n_i}}\). Combining this conclusion with our theoretical analysis in Sect. 5, we choose \(\lambda _{n} = \alpha \sqrt{\frac{\log (Kp)}{n_{tot}}}\) where \(\alpha \) for SIMULE or NSIMULE is a hyperparameter to tune. In our experiments, \(\alpha \) is varied over a range of \( \{0.05 \times i i\in \{1,2,3,\ldots ,30 \}\}\). The Bayesian information criterion (BIC) is used for situations requiring to select a specific value of hyperparameters.
Besides, we also need to tune the hyperparameters of the baseline methods to obtain their FPR–TPR curves. If only one hyperparameter needs tuning, we follow the same strategy as SIMULE. For those baselines (JGLfused and JGLgroup) having two hyperparameters, when given a certain \(\lambda _1\) (the same as \(\lambda _n\)), we use BIC criteria to select its best \(\lambda _2\) from a range of \(\{0.05\times i i\in \{1,2,3,\ldots ,20 \}\}\).
6.1.4 Selection of hyperparameter \(\varepsilon \)
\(\varepsilon \) reflects the difference of sparsity in the shared subgraph versus the contextspecific subgraphs. Section 5.3 has discussed our choice of \(\varepsilon \) on two realworld datasets. Similarly for the simulated experiments, we select \(\varepsilon \) from a range of \(\{0.1\times i i\in \{1,2,\ldots ,9\}\}\).
6.2 Simulated Gaussian datasets

Model 1 Coming from Rothman et al. (2008), this model assumes \(\varOmega ^{(i)} = \mathbf {B}^{(i)}_I + \mathbf {B}_S + \delta ^{(i)}I\), where each offdiagonal entry in \(\mathbf {B}^{(i)}_I\) is generated independently and equal to 0.5 with probability 0.05i and 0 with probability \(1  0.05i\). The shared part \(\mathbf {B}_S\) is generated independently and equal to 0.5 with probability 0.1 and 0 with probability 0.9. \(\delta ^{(i)}\) is selected large enough to guarantee the positive definiteness of precision matrix. A clear shared structure \(\mathbf {B}_S\) exists among multiple graphs. We choose \(K \in \{2,3,4,5,6\}\) for this case.

Model 2 This model uses two specialstructure graphs, i.e., a grid graph and a ring graph. The first task uses a grid graph and the second uses a ring graph. These two special networks are popular in many realworld applications . For instance, certain biological pathways can be represented as rings. A clear shared structure exists between these two graphs. Clearly, \(K = 2\) for this case.
In summary, we first simulate precision matrices by Model 1 or Model 2. We then use multivariate distribution method (Ripley 2009) to sample multivariate Gaussian Distributed data blocks with mean 0 and covariance matrix \((\varOmega ^{(i)})^{1}\). This stochastic procedure will generate simulated data blocks with the decomposition in Eq. (5). Then we apply SIMULE, NSIMULE and baseline models on these datasets to obtain the estimated dependency networks.
6.3 Simulated nonparanormal datasets
Using the same graphs of Model 1 and Model 2, we simulate two more sets of data samples following the nonparanormal distributions in K different tasks. We pick \(K=3\) for Model 1 and \(K =2\) for Model 2. Starting from \(N(0, (\varPhi ^{(i)})^{1})\) and transforming with a monotone function \(x : \rightarrow sign(\mathbf {x})\mathbf {x}^{\frac{1}{2}}\), data sample is generated as a random vector Z, where \(sign(Z)Z^2 = (sign(\mathbf {z}_1)\mathbf {z}_1^2,\ldots ,sign(\mathbf {z}_p)\mathbf {z}_p^2) \sim N(0, (\varPhi ^{(i)})^{1})\). Thus Z follows a nonparanormal distribution. Using different monotone functions for generating simulated data will not change the resulting correlation matrix \(\mathbf {S}\) since we use the rankbased nonparametric estimator estimating the correlation matrix.
6.4 Experimental results on synthetic datasets
Figure 3 shows more detailed FPR–TPR comparisons among the joint sGGM estimators. The subfigure (a) “GaussianModel1” clearly shows that our methods obtain better curves than three multisGGM baselines. On the subfigure (b) “GaussianModel2”, the differences among multisGGM estimators are not as apparent as “GaussianModel1”. Figure 3c, d (in the second row) show FPR–TPR curves from two nonparanormal datasets. Overall, the observations from comparing these curves are consistent with those obtained from the four tables (Tables 3, 4, 5, 6).
One important hyperparameter we need to pick for SIMULE and NSIMULE is \(\varepsilon \) in Eq. (7). This hyperparameter reflects the sparsity level of the shared subgraph against the contextspecific parts. The left two subfigures (a) and (c) of Fig. 5 show the changes of sparsity level in both individual and shared subgraphs across multiple values of \(\varepsilon \) (by running SIMULE on the GaussianModel 1 case). Figure 5a is for \(K=3\) and Fig. 5c is for \(K= 6\). We can see that when \(\varepsilon \) increases, the sparsity level of shared portion decreases while the averaging sparsity of individual parts increases across both cases of K. This matches our analysis in Sect. 5.3. In real applications, \(\varepsilon \) indicates the differences of sparsity constraints we assume on shared and individual parts. It should be chosen according to the domain knowledge of a specific application.
In addition, Fig. 5e tries to investigate whether the changes of \(\varepsilon \) influence the performance (AUC) of (N)SIMULE. On datasets from Model 1, Fig. 5d shows that AUC scores of both SIMULE and NSIMULE exhibit very small variations across a large range of changing \(\varepsilon \).
Section 2.5 presents a parallel variation of SIMULE. On the simulated data of GaussianModel 1, we compare the training speed of originalSIMULE versus parallelSIMULE using Fig. 5b (the right upper subfigure). For the parallelSIMULE, we run SIMULE by paralleling “column per core” using 63 cores on a 64core machine. The baselines, including the originalSIMULE, JGLfused, JGLgroup, SIMONE and SIMONEI, are run on one restricted core on the same machine.^{8} Figure 5b provides the computational speed (training time in \(\log \)seconds) across values of dimension p. It clearly shows that the parallelSIMULE runs much faster than the singlecore SIMULE implementation and other baselines.^{9}
Furthermore, Fig. 5d, f provide AUC scores of SIMULE, three multiSGGM baselines, SIMONEI and SIMULEI against the varying p and against the with varying K. SIMULEI provides a consistent performance improvement over SIMULE. SIMULE outperforms three multisGGM baselines across multiple values of p and K. SIMULEI outperforms SIMONEI and runs smoothly for larger values of K. Unfortunately Fig. 5f can not provide the AUC scores of SIMONEI for \(K=5\) and \(K=6\), because the SIMONE package could not converge for some \(\lambda \) values under these two cases of K.
6.5 Experiment results on real application I: identifying gene interaction using gene expression data across two cell contexts
Next, we apply SIMULE and the baselines on one realworld biomedical data: gene expression profiles describing many human samples across multiple cancer types (aggregated by McCall et al. 2011). Recently advancements in genomewide monitoring have resulted in enormous amounts of data across most of the common cell contexts, like multiple common cancer types (The Cancer Genome Atlas Research Network 2011). Complex diseases such as cancer are the result of multiple genetic and epigenetic factors. Thus, recent research has shifted towards the identification of multiple genes/proteins that interact directly or indirectly in contributing to certain disease(s). Structure learning of UGMs on such heterogeneous datasets can uncover statistical dependencies among genes and understand how such dependencies vary from normal to abnormal or across different diseases. These structural variations are highly likely to be contributing markers that influence or cause the diseases.
6.6 Experiment results on real application II : identifying collaborations among TFs across multiple cell types
In molecular biology, the regulatory proteins that interact with one another to control gene transcription are known as transcription factors (TFs). TF proteins typically perform major cell regulatory functions (e.g., binding on DNA) by working together with other TFs. The collaboration patterns (e.g., conditional independence) among TFs normally vary across different cell contexts (e.g., cell lines). Meanwhile, a certain portion of the TF interactions are preserved across contexts. Understanding the collaboration networks among TFs is the key to understanding cell development, including defects, which lead to different diseases. The ChIPSeq datasets recently made available by the ENCODE project (ENCODE Project Consortium 2011) provide simultaneous binding measurements of TFs to thousands of gene targets. These measurements provide a “snapshot” of TF binding events across many cell contexts. The task of uncovering the functional dependencies among TFs connects to the task of discovering the statistical dependencies among TFs from their ChIPSeq measurements.
Recently, two relevant papers (Cheng et al. 2011; Min et al. 2014) have discussed methods to infer coassociation networks among TFs using ChIPSeq data. Their approaches differ from ours as both projects has targeted a single cell type at a time. We select ChIPSeq data for 27 TFs that are covered by ENCODE (ENCODE Project Consortium 2012) across three major human cell lines including (1) H1hESC (embryonic stem cells : primary tissue), (2) GM12878 (Blymphocyte:normal tissue) and (3) K562 (leukemia:cancerous tissue).
We apply SIMULE, SIMULEI, JGLgroup, JGLfused, SIMONE and SIMONEI to this multicell TF ChIPSeq dataset. Comparisons of different methods are performed using three major existing protein interaction databases (Prasad et al. 2009; Orchard et al. 2013; Stark et al. 2006). The numbers of matches between TFTF interactions in databases and those predicted by each method have been plotted as a bar graph shown in Fig. 6b (the right subfigure). The graph shows that SIMULE consistently outperforms JGLfused, JGLgroup and SIMONE on both individual and shared interactions from all three cell types. SIMULEI performs better than SIMONEI and SIMULE. We further evaluated the resulting TF interactions using the popular “functional enrichment” analysis with DAVID (Da Wei Huang and Lempicki 2008) and found that SIMULE and SIMULEI can reveal known functional sets and potentially novel interactions that drive leukemia. This leads us to believe that our approach can be used in a wider range of applications as well.
Many domainspecific studies have applied sGGM on realworld datasets, especially those from molecular biology or brain science. For example, Ma et al. (2007) estimates gene networks using miroarray data in the model species Arabidopsis. Krumsiek et al. (2011) explores GGM to reconstruct pathway reactions from highthroughput metabolomics data. Within brain science, a few studies (Ng et al. 2013; Huang et al. 2010; Monti et al. 2015; Sun et al. 2009) have tried to learn brain connectivity of Alzheimer’s disease through sparse inverse covariance estimation. Due to space limit, we omit detailed inclusions of these studies.
7 Proof of Theorems
In this section, I(S) denotes the indicator function of the set S.
7.1 Lemma 2
Proof
In the following, we just show the case of \(1 \le j \le p\). When \(j > p\), by replacing \(e_{j \mod p}\) in the following proofs, it will still hold. \(\varSigma _{tot}\hat{\varOmega }_{tot}^1  I_{\infty } \le \lambda _n \) is equivalent to \(\varSigma _{tot}\hat{\omega }_j^1  e_j_{\infty } \le \lambda _n\) for \(1 \le j \le p\).
7.2 Theorem 3
Proof
Finally, Inequality (22) can be derived from Inequality (20), Inequality (36) and the inequality relationship \(\mathbf {A}^2_{F} \le p\mathbf {A}_1\mathbf {A}_{\infty }\). \(\square \)
7.3 Theorem 4
Proof
Theorem 4(a)
Proof
Theorem 4(b)
Let \(\bar{\mathbf {Y}}_{kij} = \mathbf {X}_{ki}\mathbf {X}_{kj}I\{\mathbf {X}_{ki}\mathbf {X}_{kj}\le \sqrt{n_{tot}/(\log Kp)^3}\}  {\mathbb E}\mathbf {X}_{ki}\mathbf {X}_{kj}I\{\mathbf {X}_{ki}\mathbf {X}_{kj}\le \sqrt{n_{tot}/(\log Kp)^3}\}\), \(\check{\mathbf {Y}}_{kij} = \mathbf {Y}_{kij}  \bar{\mathbf {Y}}_{kij}\).
Since
7.4 Theorem 7
Proof
8 Conclusions
This paper introduces a novel method SIMULE for learning shared and distinct patterns simultaneously when inferring multiple sGGMs or sNGMs jointly. Through \(\ell \)1constrained formulation our solution is efficient and can be parallelized. We successfully apply SIMULE on four synthetic datasets and two realworld datasets. We prove the convergence property of SIMULE to be favorable and justify the benefit of multitasking. Future work will extend SIMULE to model more complex relationships among contexts.
Footnotes
 1.
Following convention, \(\varSigma \), \(\varOmega \), \(\varPhi \), \(\beta \), \(\mu \), \(\theta \) and I are not bold.
 2.
 3.
We can not find CSSLGGM implementation, therefore can not include it as a baseline.
 4.
This assumes more interactions are preserved across cell contexts, i.e., partly due to concerns of system or evolutionary stability.
 5.
We extend JGL and SIMONE on nonparanormal distributions by replacing the codes for sample covariance matrix into Kendall’s tau correlation matrix in their R implementations.
 6.
It is possible to combine nonparanormal and intertwined strategies to extend SIMULE. We leave this as a future work.
 7.
We revise the R packages of three baselines to extend them to nonparanormal distribution. This is the provide a more fair comparison of these baselines versus NSIMULE.
 8.
The multicore setting we use for this time experiment is to reflect the distributed parallel nature of SIMULE. We leave the topic of using the multithreading to improve SIMULE as future research.
 9.
When \(p \ge 400\), SIMONEI takes more than 5 days to train. That’s why the no data points are shown in Fig. 5b for such cases.
 10.
NSIMULE was tried as well and has achieved the same validation result as SIMULE.
 11.
We would like to point out that the interactions SIMULE finds represent statistical dependencies between genes that vary across multiple cell types. There exist many possibilities for such interactions, including like physical proteinprotein interactions, regulatory gene pairs or signaling relationships. Therefore, we combine multiple existing databases for a joint validation.
Notes
Acknowledgements
This work was supported by the National Science Foundation under NSF CAREER Award No. 1453580. Any Opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect those of the National Science Foundation.
References
 Antoniadis, A., & Fan, J. (2011). Regularization of wavelet approximations. Journal of the American Statistical Association, 96(455), 939–967.MathSciNetCrossRefzbMATHGoogle Scholar
 Banerjee, O., El Ghaoui, L., & d’Aspremont, A. (2008). Model selection through sparse maximum likelihood estimation for multivariate Gaussian or binary data. The Journal of Machine Learning Research, 9, 485–516.MathSciNetzbMATHGoogle Scholar
 Boyd, S. P., & Vandenberghe, L. (2004). Convex optimization. Cambridge: Cambridge University Press.CrossRefzbMATHGoogle Scholar
 Buchman, D., Schmidt, M., Mohamed, S., Poole, D., & de Freitas, N. (2012). On sparse, spectral and other parameterizations of binary probabilistic models. In AISTATS (pp. 173–181)Google Scholar
 Cai, T., Liu, W., & Luo, X. (2011). A constrained 1 minimization approach to sparse precision matrix estimation. Journal of the American Statistical Association, 106(494), 594–607.MathSciNetCrossRefzbMATHGoogle Scholar
 Candes, E., & Tao, T. (2007). The Dantzig selector: Statistical estimation when p is much larger than n. The Annals of Statistics, 35(6), 2313–2351.MathSciNetCrossRefzbMATHGoogle Scholar
 Caruana, R. (1997). Multitask learning. Machine Learning, 28(1), 41–75.MathSciNetCrossRefGoogle Scholar
 Cheng, C., Yan, K. K., Hwang, W., Qian, J., Bhardwaj, N., Rozowsky, J., et al. (2011). Construction and analysis of an integrated regulatory network derived from highthroughput sequencing data. PLoS Computational Biology, 7(11), e1002190.CrossRefGoogle Scholar
 Chiquet, J., Grandvalet, Y., & Ambroise, C. (2011). Inferring multiple graphical structures. Statistics and Computing, 21(4), 537–553.MathSciNetCrossRefzbMATHGoogle Scholar
 Da Wei Huang, B. T. S., & Lempicki, R. A. (2008). Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nature Protocols, 4(1), 44–57.CrossRefGoogle Scholar
 Danaher, P., Wang, P., & Witten, D. M. (2013). The joint graphical lasso for inverse covariance estimation across multiple classes. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 76(2), 373–397.MathSciNetCrossRefGoogle Scholar
 Di Martino, A., Yan, C. G., Li, Q., Denio, E., Castellanos, F. X., Alaerts, K., et al. (2014). The autism brain imaging data exchange: Towards a largescale evaluation of the intrinsic brain architecture in autism. Molecular Psychiatry, 19(6), 659–667.CrossRefGoogle Scholar
 ENCODE Project Consortium. (2011). A user’s guide to the encyclopedia of DNA elements (ENCODE). PLoS Biology, 9(4), e1001046.Google Scholar
 ENCODE Project Consortium. (2012). An integrated encyclopedia of DNA elements in the human genome. Nature, 489(7414), 57–74.Google Scholar
 Evgeniou, T., & Pontil, M. (2004). Regularized multitask learning. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 109–117). ACM.Google Scholar
 Fan, J., Han, F., & Liu, H. (2014). Challenges of big data analysis. National Science Review,. doi: 10.1093/nsr/nwt032.Google Scholar
 Fan, J., Liao, Y., & Mincheva, M. (2013). Large covariance estimation by thresholding principal orthogonal complements. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 75(4), 603–680.MathSciNetCrossRefGoogle Scholar
 Fazayeli, F., & Banerjee, A. (2016). Generalized direct change estimation in ising model structure. arXiv preprint arXiv:1606.05302.
 Friedman, J., Hastie, T., & Tibshirani, R. (2008). Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 9(3), 432–441.CrossRefzbMATHGoogle Scholar
 Guo, J., Levina, E., Michailidis, G., & Zhu, J. (2011). Joint estimation of multiple graphical models. Biometrika,. doi: 10.1093/biomet/asq060.MathSciNetzbMATHGoogle Scholar
 Han, F., Liu, H., & Caffo, B. (2013). Sparse median graphs estimation in a high dimensional semiparametric model. arXiv preprint arXiv:1310.3223.
 Hara, S., & Washio, T. (2013). Learning a common substructure of multiple graphical Gaussian models. Neural Networks, 38, 23–38.CrossRefzbMATHGoogle Scholar
 Hastie, T., Tibshirani, R., Friedman, J., Hastie, T., Friedman, J., & Tibshirani, R. (2009). The elements of statistical learning. Berlin: Springer.CrossRefzbMATHGoogle Scholar
 Höfling, H., & Tibshirani, R. (2009). Estimation of sparse binary pairwise markov networks using pseudolikelihoods. The Journal of Machine Learning Research, 10, 883–906.MathSciNetzbMATHGoogle Scholar
 Honorio, J., & Samaras, D. (2010). Multitask learning of Gaussian graphical models. In Proceedings of the 27th international conference on machine learning (ICML10) (pp. 447–454).Google Scholar
 Hsieh, C. J., Sustik, M. A., Dhillon, I. S., & Ravikumar, P. D. (2011). Sparse inverse covariance matrix estimation using quadratic approximation. In NIPS (pp. 2330–2338).Google Scholar
 Huang, S., Li, J., Sun, L., Ye, J., Fleisher, A., Wu, T., et al. (2010). Learning brain connectivity of alzheimer’s disease by sparse inverse covariance estimation. NeuroImage, 50(3), 935–949.CrossRefGoogle Scholar
 Ideker, T., & Krogan, N. J. (2012). Differential network biology. Molecular Systems Biology, 8(1), 565.Google Scholar
 Kelly, C., Biswal, B. B., Craddock, R. C., Castellanos, F. X., & Milham, M. P. (2012). Characterizing variation in the functional connectome: Promise and pitfalls. Trends in Cognitive Sciences, 16(3), 181–188.CrossRefGoogle Scholar
 Kolar, M., Song, L., Ahmed, A., Xing, E. P., et al. (2010). Estimating timevarying networks. The Annals of Applied Statistics, 4(1), 94–123.MathSciNetCrossRefzbMATHGoogle Scholar
 Krumsiek, J., Suhre, K., Illig, T., Adamski, J., & Theis, F. J. (2011). Gaussian graphical modeling reconstructs pathway reactions from highthroughput metabolomics data. BMC Systems Biology, 5(1), 21.CrossRefGoogle Scholar
 Lam, C., & Fan, J. (2009). Sparsistency and rates of convergence in large covariance matrix estimation. Annals of Statistics, 37(6B), 4254.MathSciNetCrossRefzbMATHGoogle Scholar
 Lauritzen, S. L. (1996). Graphical models. Oxford: Oxford University Press.zbMATHGoogle Scholar
 Levina, E., Rothman, A., Zhu, J., et al. (2008). Sparse estimation of large covariance matrices via a nested lasso penalty. The Annals of Applied Statistics, 2(1), 245–263.MathSciNetCrossRefzbMATHGoogle Scholar
 Liu, H., Han, F., & Zhang, C. (2012). Transelliptical graphical models. In Advances in Neural Information Processing Systems (pp. 809–817).Google Scholar
 Liu, H., Lafferty, J., & Wasserman, L. (2009). The nonparanormal: Semiparametric estimation of high dimensional undirected graphs. The Journal of Machine Learning Research, 10, 2295–2328.MathSciNetzbMATHGoogle Scholar
 Liu, H., Wang, L., & Zhao, T. (2014). Sparse covariance matrix estimation with eigenvalue constraints. Journal of Computational and Graphical Statistics, 23(2), 439–459.MathSciNetCrossRefGoogle Scholar
 Liu, S., Quinn, J. A., Gutmann, M. U., & Sugiyama, M. (2013). Direct learning of sparse changes in Markov networks by density ratio estimation. In Joint European conference on machine learning and knowledge discovery in databases (pp. 596–611). Springer.Google Scholar
 Ma, S., Gong, Q., & Bohnert, H. J. (2007). An arabidopsis gene network based on the graphical Gaussian model. Genome Research, 17(11), 1614–1625.CrossRefGoogle Scholar
 Mardia, K. V., Kent, J. T., & Bibby, J. M. (1980). Multivariate analysis. London: Academic Press.zbMATHGoogle Scholar
 McCall, M. N., Uppal, K., Jaffee, H. A., Zilliox, M. J., & Irizarry, R. A. (2011). The gene expression barcode: Leveraging public data repositories to begin cataloging the human and murine transcriptomes. Nucleic Acids Research, 39(suppl 1), D1011–D1015.CrossRefGoogle Scholar
 Meinshausen, N., & Bühlmann, P. (2006). Highdimensional graphs and variable selection with the lasso. The Annals of Statistics, 34(3), 1436–1462.MathSciNetCrossRefzbMATHGoogle Scholar
 Min, M. R., Ning, X., Cheng, C., & Gerstein, M. (2014). Interpretable sparse highorder Boltzmann machines. In Proceedings of the seventeenth international conference on artificial intelligence and statistics (pp. 614–622).Google Scholar
 Mohan, K., London, P., Fazel, M., Lee, S. I., & Witten, D. (2013). Nodebased learning of multiple Gaussian graphical models. arXiv preprint arXiv:1303.5145.
 Monti, R. P., Anagnostopoulos, C., & Montana, G. (2015). Learning population and subjectspecific brain connectivity networks via mixed neighborhood selection. arXiv preprint arXiv:1512.01947.
 Negahban, S., Yu, B., Wainwright, M. J., & Ravikumar, P. K. (2009). A unified framework for highdimensional analysis of \( m \)estimators with decomposable regularizers. In Advances in Neural Information Processing Systems (pp. 1348–1356).Google Scholar
 Ng, B., Varoquaux, G., Poline, J. B., & Thirion, B. (2013). A novel sparse group Gaussian graphical model for functional connectivity estimation. In Information processing in medical imaging (pp. 256–267). Springer.Google Scholar
 Orchard, S., Ammari, M., Aranda, B., Breuza, L., Briganti, L., BroackesCarter, F., et al. (2013). The MIntAct project IntAct as a common curation platform for 11 molecular interaction databases. Nucleic Acids Research,. doi: 10.1093/nar/gkt1115.Google Scholar
 Pang, H., Liu, H., & Vanderbei, R. (2014). The fastclime package for linear programming and largescale precision matrix estimation in R. Journal of Machine Learning Research, 15, 489–493.zbMATHGoogle Scholar
 Prasad, T. K., Goel, R., Kandasamy, K., Keerthikumar, S., Kumar, S., Mathivanan, S., et al. (2009). Human protein reference database 2009 update. Nucleic Acids Research, 37(suppl 1), D767–D772.CrossRefGoogle Scholar
 Qiu, H., Han, F., Liu, H., & Caffo, B. (2013). Joint estimation of multiple graphical models from high dimensional time series. arXiv preprint arXiv: 1311.0219.
 Ripley, B. D. (2009). Stochastic simulation (Vol. 316). London: Wiley.zbMATHGoogle Scholar
 Rothman, A. J. (2012). Positive definite estimators of large covariance matrices. Biometrika, 99(3), 733–740.MathSciNetCrossRefzbMATHGoogle Scholar
 Rothman, A. J., Bickel, P. J., Levina, E., Zhu, J., et al. (2008). Sparse permutation invariant covariance estimation. Electronic Journal of Statistics, 2, 494–515.MathSciNetCrossRefzbMATHGoogle Scholar
 Schmidt, M., & Murphy, K. (2010). Convex structure learning in loglinear models: Beyond pairwise potentials. In Proceedings of the international conference on artificial intelligence and statistics (AISTATS).Google Scholar
 Stark, C., Breitkreutz, B. J., Reguly, T., Boucher, L., Breitkreutz, A., & Tyers, M. (2006). Biogrid: A general repository for interaction datasets. Nucleic Acids Research, 34(suppl 1), D535–D539.CrossRefGoogle Scholar
 Sugiyama, M., Kanamori, T., Suzuki, T., du Plessis, M. C., Liu, S., & Takeuchi, I. (2013). Densitydifference estimation. Neural Computation, 25(10), 2734–2775.MathSciNetCrossRefGoogle Scholar
 Sun, L., Patel, R., Liu, J., Chen, K., Wu, T., Li, J., et al. (2009). Mining brain region connectivity for Alzheimer’s disease study via sparse inverse covariance estimation. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 1335–1344). ACM.Google Scholar
 The Cancer Genome Atlas Research Network. (2011). Integrated genomic analyses of ovarian carcinoma. Nature, 474(7353), 609–615.Google Scholar
 Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), 267–288.Google Scholar
 Wainwright, M. J., & Jordan, M. I. (2006). Logdeterminant relaxation for approximate inference in discrete Markov random fields. IEEE Transactions on Signal Processing, 54(6), 2099–2109.CrossRefGoogle Scholar
 Yang, E., Lozano, A. C., & Ravikumar, P. K. (2014). Elementary estimators for graphical models. In Advances in neural information processing systems (pp. 2159–2167).Google Scholar
 Yuan, M., & Lin, Y. (2007). Model selection and estimation in the Gaussian graphical model. Biometrika, 94(1), 19–35.MathSciNetCrossRefzbMATHGoogle Scholar
 Zhang, B., & Wang, Y. (2012). Learning structural changes of Gaussian graphical models in controlled experiments. arXiv preprint arXiv:1203.3532.
 Zhang, C. H. (2010). Nearly unbiased variable selection under minimax concave penalty. The Annals of statistics, 38(2), 894–942.MathSciNetCrossRefzbMATHGoogle Scholar
 Zhang, Y., & Schneider, J. G. (2010). Learning multiple tasks with a sparse matrixnormal penalty. In Advances in neural information processing systems (pp. 2550–2558).Google Scholar
 Zhu, Y., Shen, X., & Pan, W. (2014). Structural pursuit over multiple undirected graphs. Journal of the American Statistical Association, 109(508), 1683–1696.MathSciNetCrossRefzbMATHGoogle Scholar
Copyright information
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.