Abstract
Identifying contextspecific entity networks from aggregated data is an important task, arising often in bioinformatics and neuroimaging applications. Computationally, this task can be formulated as jointly estimating multiple different, but related, sparse undirected graphical models (UGM) from aggregated samples across several contexts. Previous jointUGM studies have mostly focused on sparse Gaussian graphical models (sGGMs) and can’t identify contextspecific edge patterns directly. We, therefore, propose a novel approach, SIMULE (detecting Shared and Individual parts of MULtiple graphs Explicitly) to learn multiUGM via a constrained \(\ell \)1 minimization. SIMULE automatically infers both specific edge patterns that are unique to each context and shared interactions preserved among all the contexts. Through the \(\ell \)1 constrained formulation, this problem is cast as multiple independent subtasks of linear programming that can be solved efficiently in parallel. In addition to Gaussian data, SIMULE can also handle multivariate Nonparanormal data that greatly relaxes the normality assumption that many realworld applications do not follow. We provide a novel theoretical proof showing that SIMULE achieves a consistent result at the rate \(O(\log (Kp)/n_{tot})\). On multiple synthetic datasets and two biomedical datasets, SIMULE shows significant improvement over stateoftheart multisGGM and singleUGM baselines (SIMULE implementation and the used datasets @https://github.com/QData/SIMULE).
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
Undirected graphical models (UGMs) provide a powerful tool for understanding statistical relationships among random variables. In a typical setting, we can represent the conditional dependency patterns among p random variables \(\{ X_1,\ldots ,X_p \}\) using an undirected graph \(G=(V,E)\). V includes p nodes corresponding to the p variables. E denotes the set of edges describing conditional dependencies among the variables \(\{X_1, \ldots , X_p\}\). If a pair of random variables is conditionally dependent given the rest of variables, there exists an edge in E connecting the corresponding pair of nodes in V; otherwise, the edge is absent. Within the graphical model framework, the task of estimating such undirected graphs based on a set of observed data samples is called structure estimation or model selection. Much of the related literature has focused on estimating G from a given data matrix \(\mathbf {X}_{n\times p}\) (with n observations across p random variables) that are independently and identically drawn from \(N_p(\mu , \varSigma )\). Here \(N_p(\mu , \varSigma )\) represents a multivariate Gaussian distribution with mean vector \(\mu \) (\(\mu \in \mathbb {R}^p\)) and covariance matrix \(\varSigma \) (with size \(p \times p\)). Using the aforementioned G to describe pairwise dependencies among p variables for such a multivariate Gaussian distribution is called a Gaussian graphical model (GGM, e.g., Lauritzen 1996; Mardia et al. 1980). The inverse of the covariance matrix is called the precision matrix, \(\varOmega := \varSigma ^{1}\). Interestingly, GGM’s conditional independence pattern corresponds to zeros of \(\varOmega \). This means that an edge does not connect ith node and jth node (i.e., conditionally independent) in GGM if and only if \(\varOmega _{ij} = 0\).
To achieve a consistent estimation of G, assumptions are usually imposed on the structure of \(\varOmega \). Most commonly, the graph sparsity assumption has been introduced by various estimators to derive sparse GGM (sGGM). The graph sparsity assumption corresponds to limiting the number of nonzero entries in the precision matrix \(\varOmega \), which leads to a combinational problem for structure estimation. Many classic GGM estimators use the \(\ell _1\)norm to create convex relaxation of the combinatorial formulation. For instance, the popular estimator “graphical lasso” (GLasso) has considered maximizing a \(\ell _1\)penalized loglikelihood objective (Yuan and Lin 2007; Banerjee et al. 2008; Hastie et al. 2009; Rothman et al. 2008). More recently, Cai et al. (2011) proposed a constrained \(\ell _1\)minimization formulation for estimating \(\varOmega \), known as the CLIME estimator. CLIME can be solved through columnwise linear programming and has shown favorable theoretical properties. Moreover, the nonparanormal graphical models (NGM) recently proposed by Liu et al. (2012) have extended sGGM to new distribution families. Both GGM and NGM belong to the general family of UGM (reviewed in Sect. 4).
This paper focuses on the problem of jointly estimating K undirected graphical models from K related multivariate sample blocks. Each sample block contains a different set of data observations on the same set of variables. This task is motivated by the fact that the past decade has seen a revolution in collecting largescale heterogeneous data from many scientific fields like genetics and brain science. For instance, genomic technologies have delivered fast and accurate molecular profiling data across many cellular contexts (e.g. cell lines or cell stages) (ENCODE Project Consortium 2011). Many neuroimaging studies have collected functional measurements of brain regions across a cohort of multiple human subjects (Di Martino et al. 2014). Such networks can be concerned with identifying subjectspecific variations across a population, where each individual is a unique context. For this type of data, understanding and quantifying contextspecific variations across multiple graphs is a fundamental analysis task. Figure 1 provides a simple illustration (with two contexts) of the target problem. Interaction patterns that are activated only under a specific context can help to understand or to predict the importance of such a context (Ideker and Krogan 2012; Kelly et al. 2012).
Prior approaches for estimating UGMs from such heterogeneous data tend to either only estimate pairwise differential patterns between two graphs or jointly estimate multiple sGGMs toward a common graph pattern (reviewed in Sect. 4). The former strategy does not exploit the shared network structure across contexts and is not applicable for more than two contexts, leading to undesirable effects on the quality of the estimated networks. Conversely, the latter approach underestimates the network variability and makes implicit assumptions to minimize intercontext differences which are difficult to justify in practice. This is partly caused by the fact that relevant studies have mostly extended the “graphical lasso” (GLasso) estimator to multitask settings and followed a penalized loglikelihood formulation [Eq. (2)]. Under the GLasso framework, however, explicitly quantifying the contextspecific substructures involves a very challenging optimization task (explained in detail in Sect. 4).
This paper proposes a novel approach that uses a constrained \(\ell \)1minimization formulation for joint structure learning of multiple sparse GGMs or NGMs. We name the method SIMULE (detecting Shared and Individual parts of MULtiple graphs Explicitly), and include the following contributions:

Novel model Using a constrained \(\ell \)1 optimization strategy (Sect. 2), SIMULE extends CLIME to a multitask setting. The learning step is solved efficiently through a formulation of multiple independent subproblems of linear programming (Sect. 2.4) for which we also provide a parallel version of the learning algorithm. Compared with previous multitask sGGM models, SIMULE can accurately quantify taskspecific network variations that are unique for each task. This also leads to a better generalization and benefits all the involved tasks.

Novel extension Furthermore, since most realworld datasets do not follow the normality assumption, we extend SIMULE to Nonparanormal SIMULE (NSIMULE in Sect. 3) by learning multiple NGM under the proposed \(\ell \)1 constrained minimization. NSIMULE can deal with nonGaussian data that follow the nonparanormal distribution (Liu et al. 2009), a much more generic data distribution family. Fitting NSIMULE is computationally as efficient as SIMULE.

Theoretical convergence rate In Sect. 5, we theoretically prove that SIMULE and NSIMULE achieve a consistent estimation of the target (true) dependency graphs with a high probability at the rate \(O(\log (Kp)/n_{tot})\). Here \(n_{tot}\) represents the total number of samples from all tasks and K describes the number of tasks. This proof also theoretically validates the benefit of learning multiple sGGMs jointly (Sect. 7), since the \(O(\log (Kp)/n_{tot})\) convergence rate is better than learning multiple singlesGGMs separately at rate \(O(\log p/n_i)\). \(n_{i}\) represents the number of samples of ith task. Such an analysis hasn’t been provided in any of the previous multisGGM studies.

Performance improvement In Sect. 6 we show a strong improved performance of SIMULE and NSIMULE over multiple baseline methods on multiple synthetic datasets and two realworld multicell biomedical datasets. The proposed methods obtain better AUC and partial AUC scores on all simulated cases. On two realworld datasets, our methods find the most matches of variable interactions when using existing BioMeddatabases for validation.
2 Method: SIMULE
Learning multiple UGM jointly is a task of interests in many applications. This paper tries to model and learn contextspecific graph variations explicitly, because such variations can “fingerprint” important markers for fields like cognition (Ideker and Krogan 2012), physiology (Kelly et al. 2012) or pathology (Ideker and Krogan 2012; Kelly et al. 2012). We consider the general case of estimating K graphical models from a pdimensional aggregated dataset in the form of K different data blocks.
In what follows, plain letters denote scalars. Uppercase and lowercase bold letters denote matrices and vectors respectively.^{Footnote 1} We denote \(\mathbf {X}^{(i)}_{n_i \times p}\) as the ith data block (or data matrix). The total number of data samples uses notation \(n_{tot} =\sum \nolimits _{i=1}^{K}n_i\). The precision matrix uses notation \(\varOmega \) and the covariance matrix uses notation \(\varSigma \). We denote the correlation matrix as S and the inverse of correlation matrix as \(\varPhi \). The vector \(\mathbf {e}_j = (0,\ldots ,1,\ldots ,0)^T\) denotes a basis vector in which only the jth entry is 1 and the rest are 0. For a pdimensional data vector \(\mathbf {x}= (x_1, x_2, \ldots , x_p)^T \in \mathbb {R}^p\), let \(\mathbf {x}_1 = \sum \nolimits _ix_i\) be the \(\ell _1\)norm of \(\mathbf {x}\), \(\mathbf {x}_{\infty } = \max \nolimits _ix_i\) be the \(\ell _{\infty }\)norm of \(\mathbf {x}\) and \(\mathbf {x}_2 = \sqrt{\sum \nolimits _i x_i^2}\) be the \(\ell _2\)norm of\(\mathbf {x}\). Similarly, for a matrix \(\mathbf {X}\), let \(\mathbf {X}_1 = \sum \nolimits _{i,j}\mathbf {X}_{i,j}\) be the \(\ell _1\)norm of \(\mathbf {X}\) and \(\mathbf {X}_{\infty } = \max \nolimits _{i,j}\mathbf {X}_{i,j}\) be the \(\ell _{\infty }\)norm of \(\mathbf {X}\). \(\mathbf {X}_2 = \sqrt{\lambda _{\max }(\mathbf {X})}\), here \(\lambda _{\max }\) is the largest eigenvalue of \(\mathbf {X}\). \(\mathbf {X}_F = \sqrt{\sum \nolimits _{i,j}\mathbf {X}_{i,j}^2}\) is the Fnorm of \(\mathbf {X}\). \(\mathbf {X}_\mathbf{1 } = \max \nolimits _{j} \sum \nolimits _i \mathbf {X}_{ij}\) is the matrix \(\mathbf 1 \)norm of \(\mathbf {X}\). \(\mathbf {X}^{(1)}, \mathbf {X}^{(2)}, \ldots , \mathbf {X}^{(K)}_{1,p} = \sum \nolimits _{i}\mathbf {X}^{(i)}_p\) is the \(\ell _{1,p}\)norm of \((\mathbf {X}^{(1)}, \mathbf {X}^{(2)}, \ldots , \mathbf {X}^{(K)})\). \(\varOmega \succ 0\) means that \(\varOmega \) is a positive definite matrix. \((\varOmega ^{(1)}, \varOmega ^{(2)}, \ldots , \varOmega ^{(K)})_{\mathcal {G},2} = \sum \nolimits _{j = 1}^p \sum \nolimits _{k = 1}^p (\varOmega _{j,k}^{(1)}, \varOmega _{j,k}^{(2)}, \ldots , \varOmega _{j,k}^{(i)}, \ldots , \varOmega _{j,k}^{(K)})_2\). \((\varOmega ^{(1)}, \varOmega ^{(2)}, \ldots , \varOmega ^{(K)})_{\mathcal {G},\infty } = \sum \nolimits _{j = 1}^p \sum \nolimits _{k = 1}^p (\varOmega _{j,k}^{(1)}, \varOmega _{j,k}^{(2)}, \ldots , \varOmega _{j,k}^{(i)}, \ldots , \varOmega _{j,k}^{(K)})_\infty \).
2.1 Background: CLIME for estimating sparse Gaussian graphical model
The CLIME estimator, short for constrained \(\ell _1\) minimization method for inverse matrix estimation (Cai et al. 2011), uses an \(\ell _1\) constrained optimization [as Eq. (1)] to estimate the precision matrix \(\varOmega \):
Here \(\lambda > 0\) is the tuning parameter.
The idea comes from taking the first derivative of the objective function of “graphical Lasso” (GLasso) (Yuan and Lin 2007; Friedman et al. 2008). The original objective function (an \(\ell _1\)penalized loglikelihood) is as follows,
By taking the first derivative of Eq. (2) and setting it equal to zero, the solution \(\widehat{\varOmega }_{glasso}\) also satisfies:
where \(\widehat{Z}\) is an element of the subdifferential \( \partial \widehat{\varOmega }_{glasso}_1\). This leads to the CLIME estimator in Eq. (1). One of the best properties of CLIME is that it can be solved column by column separately. Suppose \(\beta \) is one of the column vectors in the precision matrix \(\varOmega \). Instead of estimating the entire \(\varOmega \) all at once as in Eq. (1), we can estimate each column \(\beta \) of \(\varOmega \) as follows:
After we obtain an estimated \(\widehat{\varOmega }^{1}\) from Eq. (1), to maintain the symmetric property of the estimator, CLIME then uses the following operation,
Cai et al. (2011) has proved that CLIME achieves a consistent estimation of the true graph with a high probability at the rate \(O(\log p/n)\).
2.2 Background: multitask learning with taskshared and taskspecific parameters
Multitask learning (MTL) was initially proposed by Caruana (1997) to find common feature weights across multiple relevant tasks. If different tasks are sufficiently related, MTL can lead to a better generalization across all tasks. The specific MTL formulation we explore is suggested by MTSVM (Evgeniou and Pontil 2004), which models multiple support vector machines (SVMs) through taskjoint and taskspecific parameters. When given K related tasks in which each sample \((\varvec{x}_i, y_i)\) belongs to exactly one of the tasks \(\{1,\ldots ,K\}\), MTSVM (Evgeniou and Pontil 2004) learns K distinct parameters \(\varvec{w}_1,\ldots ,\varvec{w}_K\) where each \(\varvec{w}_k\) is specifically dedicated for the task k. In addition, MTSVM utilizes a global parameter \(\varvec{w}_0\) to capture the commonality among all the tasks. Each sample \(\varvec{x}\) is classified by \(\hat{y}_i = \text {sign} (\varvec{x}^T_i ( \varvec{w}_0 + \varvec{w}_k))\). Clearly, the vectors \(\varvec{w}_k\) are “small” when the tasks are similar to each other. In other words, we assume that the tasks are related in such a way that the true models are all close to some (shared) model \(\varvec{w}_0\). MTSVM estimates all \(\varvec{w}_1,\ldots ,\varvec{w}_K\) and \(\varvec{w}_0\) simultaneously.
2.3 SIMULE: infer Shared and Individual parts of MULtiple sGGM Explicitly
Treating sparse GGM estimation from each data block as a single task, our main goal is to learn multiple sGGMs over K tasks jointly, which can lead to a better generalization across all of the involved tasks (theoretically proven in Sect. 5).
Mirroring the strategy used by previous jointestimators of multiple GLasso (described in Sect. 4), we achieve “multitasking” by summing up the CLIME estimators from each of the K tasks.
Here \(n_{tot} =\sum \nolimits _{i=1}^{K}n_i\) represents the total number of data samples used in all tasks. It is worth mentioning that we select the hyperparameter by considering the sample size, i.e., using \(\lambda _{n_{tot}}\) instead of \(\lambda \) (based on two recent studies (Negahban et al. 2009; Yang et al. 2014), details in Sect. 6). To simplify notations, we use \(\lambda _n\) instead of \(\lambda _{n_{tot}}\) in the rest of this section.
Then following the MTL formulation of MTSVM, we simply model each sGGM network as
where \(\varOmega _S\) is the shared pattern among all graphs and \(\varOmega _I^{(i)}\) represents the individual part specific for ith graph (see Fig. 1 for a simple illustration). Furthermore, we assume both the shared and the individual parts of each \(\varOmega ^{(i)}\) should be sparse and convert this assumption into the following formulation,
Here we assume \(\varepsilon > 0\). This hyperparameter is chosen according to applications. It reflects the difference of sparsity level between the shared part and the contextspecific parts. For example, due to evolutionary stability and/or system stability, we usually assume that the shared part of gene networks is more dense than individual subgraphs of each cell context. Accordingly, Sect. 6 uses \(\varepsilon < 1\) for two realworld molecular expression datasets.
Now we replace the objective in Eq. (4) with the new objective from Eq. (6) and plug Eq. (5) into the constraints of Eq. (4). This gives us the following novel formulation of SIMULE:
In Sect. 5, we theoretically prove that those \(\{ \varOmega ^{(i)}i = 1,\ldots ,k\}\) estimated from Eq. (7) are positive definite and converge to the true precision matrices with a high probability.
Equation (7) can be solved column by column without influencing the resulting solution. For each column, we solve the following optimization:
\(\beta ^{(i)}\) is one of the column vectors in the individual part \(\varOmega ^{(i)}_I\) of ith graph (we take out the subscript I in \(\beta ^{(i)}\) to simplify notations). \({\beta ^s}\) is the corresponding column in the shared part \(\varOmega _S\).
Equivalently, SIMULE solves the following optimization, for each column:
2.4 Optimization
By simplifying notations, Eq. (9) can be rewritten as
Through relaxation, we convert Eq. (10) to the following linear programming formulation in Eq. (11),
Here a set of \(\mathbf {u}_j\) are the slack variables, \(\mathbf {A}_{k,}^{(i)}\) means the kth row of \(\mathbf {A}^{(i)}\) and \(\varvec{b}_k\) is the kth entry of b. The pseudo code of SIMULE is summarized in Algorithm 1. Following CLIME, then we apply the same symmetric operators on \(\{\varOmega ^{(i)} = \varOmega _S + \varOmega _I^{(i)} \}\) obtained from Algorithm 1. Section 5 proves that \(\varOmega ^{(i)}\) converges to the true one with a optimal convergence rate.
To solve Eq. (11), we follow the primal dual interior method (Boyd and Vandenberghe 2004) that has also been used in the Dantzig selector for the task of regression (Candes and Tao 2007). Other strategies can be used to solve this linear programming, such as the one used in Pang et al. (2014).
2.5 Parallel version of SIMULE
Algorithm 1 can easily be revised into a parallel version. Essentially we just need to revise the “For loop” of step 8 in Algorithm 1 into, for instance, “column per machine” or “column per core”. Since the calculation of each column is independent from the other columns, this parallel variation will obtain the same solution as Algorithm 1 at a better speed. Section 6 shows the speed improvements of SIMULE in a multicore setting versus a singlecore setting.
3 Method variation: nonparanormal SIMULE (NSIMULE)
Though sGGM is powerful, its normality assumption is commonly violated in real applications. For instance, for the TF ChIPSeq data analyzed in Sect. 6.6, the histogram of one of its TF variables is clearly not following Gaussian distribution (across samples, shown as the right distribution graph in Fig. 2). After a univariate logtransformation of the same feature, we obtain its distribution histogram as the left graph in Fig. 2. The transformed data samples are approximately normally distributed. This motivates us to adopt a more generalized UGM (recently proposed in Liu et al. 2009) to overcome the limitation of sGGM. This socalled “nonparanormal graphical model” (Liu et al. 2009) assumes that data samples follow a multivariate nonparanormal distribution, which is a strict superset of the Gaussian distribution. We extend SIMULE to the nonparanormal family and name this novel variation NSIMULE. NSIMULE learns to fit multiple NGMs jointly through modeling taskshared and taskspecific parameters explicitly.
3.1 Background: nonparanormal graphical model
A random variable \(Z = (Z_{1}, \ldots , Z_{p})^{T}\) is said to follow a nonparanormal distribution
if and only if there exists a set of univariate strictly increasing transformations \(f = \{f_{j}\}_{j=1}^{p}\) such that:
Figure 2 shows a simple example of NGM (inside the rightmost subfigure) and its unobserved latent Gaussian graphical model (inside the leftmost subfigure). Assume that we are given a dataset including n observations that are independently and identically drawn from \(NPN_{p}(\mu ,\mathbf {S};f_{1},\ldots ,f_{p})\), a multivariate nonparanormal distribution. The conditional independence graph G among \(Z_i\) variables can be modeled with a corresponding NGM (Liu et al. 2009). The graph structure of NGM is encoded through the sparsity pattern of the inverse covariance matrix \(\varPhi := \mathbf {S}^{1}\), where \(\mathbf {S}\) denotes the covariance matrix of \(NPN_{p}\).
3.2 Background: estimate \(\mathbf {S}\) through rankbased measures of correlation matrix \(\mathbf {S}_0\)
Since the direct estimation of covariance matrix \(\mathbf {S}\) is difficult in nonparanormal distribution, recent studies have proposed an efficient nonparametric estimator (Liu et al. 2009) for \(\mathbf {S}\). This estimator is derived from the correlation matrix \(\mathbf {S}_0\). Because the covariance matrix \(\mathbf {S}= diag(\mathbf {S}_i) \mathbf {S}_0 diag(\mathbf {S}_i)\), \(\mathbf {S}^{1} = diag(\mathbf {S}_i)^{1} \mathbf {S}_0^{1} diag(\mathbf {S}_i)^{1}\). Here \(\mathbf {S}_i = \sqrt{Cov(Z_i,Z_i)}\) and \(diag(\mathbf {S}_i) = diag(\mathbf {S}_1,\mathbf {S}_2,\ldots ,\mathbf {S}_p)\). Therefore, the inverse of correlation matrix (\(\mathbf {S}_0^{1}\)) and the inverse of covariance matrix (\(\mathbf {S}^{1}\)) have the same nonzero and zero entries. Based on this observation, Liu et al. (2009) proposed a nonparametric method to estimate the correlation matrix \(\mathbf {S}_0\), instead of estimating the covariance matrix \(\mathbf {S}\) for the purpose of structure inference.
In Liu et al. (2009) the authors proposed using the population Kendall’s tau correlation coefficients \(\tau _{jk}\) to estimate \(\mathbf {S}_0\), based upon the explicit relationship between this rankbased measure \(\tau _{jk}\) and the correlation measure \((\mathbf {S}_{jk})_0\) for a given nonparanormal dataset \(Z \sim NPN_{p}(\mu , \mathbf {S},f_{1},\ldots ,f_{p})\) (discussed in Liu et al. 2012). Figure 2 presents the simple relationship between \(Z \sim NPN_{p}(\mu ,S;f_{1},\ldots ,f_{p})\) and its latent \(X \sim N(\mu ,S)\). To simplify notations, we use \(\mathbf {S}\) to represent the correlation matrix for the remainder of this paper.
Theorem 1
Given \(Z \sim NPN_{p}(\mu , S, f_{1},\ldots ,f_{p})\) , a nonparanormal distribution, we have that
where the Kendall’s tau can be estimated as:
Proof
The proof is provided in Liu et al. (2009). \(\square \)
Therefore, the correlation matrix \(\mathbf {S}\) can be estimated as:
We can then plug in the estimated \(\widehat{\mathbf {S}}\) for learning the dependency graph structure in the corresponding NGM.
3.3 NSIMULE: nonparanormal SIMULE
We can now substitute each sample covariance matrix \(\hat{\varSigma }^{(i)}\) used in Eq. (9) from each task with its corresponding correlation matrix \(\mathbf {S}^{(i)}\) as estimated above. The rest of the computations are the same as SIMULE. We refer to this whole process as NSIMULE. It estimates multiple different, but related, sparse Nonparametric Graphical Models (sNGM) through shared and taskspecific parameter representations.
Theorem 2
If X, Y are two independent random variables and f,g \(:\mathbb {R}\rightarrow \mathbb {R}\) are two measurable functions, then f(X) and g(Y) are also independent.
Through the above theorem, the monotone functions f in \(NPN_{p}\) will not change the conditional dependency among variables. As proved in Liu et al. (2009), the conditional dependency network among the latent Gaussian variables X (in this \(NPN_{p}\)) is the same as the conditional dependency network among the nonparanormal variables \(Z_i\), with a parametric asymptotic convergence rate. Therefore, we can use the estimated correlation matrices \(\mathbf {S}^{(i)}\) for the joint network inference of multiple sNGMs in SIMULE. This is also because we have shown that the inverse of the correlation matrix and the inverse of the covariance matrix share the same nonzero and zero patterns.
4 Related work
4.1 Connecting to past multisGGM studies
Sparse GGM is an extremely active topic in the recent literature including notable studies like Wainwright and Jordan (2006) and Banerjee et al. (2008). We can categories singletask sGGM estimators into three groups: (a) penalized likelihood (GLasso), (b) neighborhood approach and (c) CLIME estimator.
Most previous methods that estimate multiple sGGMs jointly (on the same set of variables from aggregated data samples) can be formulated as:
where \(\varOmega ^{(i)}\) denotes the precision matrix for the ith task. \(L(\cdot )\) represents loglikelihood or pseudolikelihood function. \(\sum \nolimits _i \varOmega ^{(i)}_1\) adds sparsity constraints on each task. \( P(\varOmega ^{(1)}, \varOmega ^{(2)}, \ldots ,\varOmega ^{(K)})\) enforces certain joint properties among the tasks (like using group sparsity to enforce similarity among tasks).
We choose the most relevant three studies as our baselines in the experiments: (a) Fused Joint graphical lasso (JGLfused) (Danaher et al. 2013), (b) Group Joint graphical lasso (JGLgroup) (Danaher et al. 2013) and (c) SIMONE (Chiquet et al. 2011). JGLfused and JGLgroup are based on the popular “graphical lasso” estimator (Friedman et al. 2008; Yuan and Lin 2007); [using \(L(\varOmega ) =(\log det(\varOmega )  <\varSigma , \varOmega >)\) in Eq. (13)]. SIMONE (Chiquet et al. 2011) follows neighborhoodselection based estimator. It can be viewed as using a pseudolikelihood approximation instead of the full likelihood as \(L(\varOmega )\) in Eq. (13).^{Footnote 2}
The general purpose of the penalty function \( P(\varOmega ^{(1)}, \varOmega ^{(2)}, \ldots ,\varOmega ^{(K)})\) in Eq. (13) is to push the inference of multiple graphs toward a common pattern. Table 1 provides a representative list of penalty functions that have been used previously for the multisGGM setting. For example, JGL uses the fused norm (the 1st row of Table 1) to penalize the difference between two graphs with resulting variation named as JGLfused. JGLgroup uses a \(\{ \mathcal {G},2 \}\) norm that pushes multiple graphs to have the same sparsity patterns (the 2nd row of Table 1). SIMONE provides a novel penalty proposed by the authors (shown as in the 3rd row of Table 1) to enforce similar sparsity pattern on multiple graphs.
In addition to these three baselines, a number of recent studies also perform multitask learning of sGGM (Honorio and Samaras 2010; Guo et al. 2011; Zhang and Wang 2012; Zhang and Schneider 2010; Zhu et al. 2014). They all follow the same formulation as Eq. (13) but explore a different second penaltyfunction \(P(\varOmega ^{(1)}, \varOmega ^{(2)}, \ldots ,\varOmega ^{(K)})\). As an example, Nodebased JGL proposed a novel penalty, namely RCON (Mohan et al. 2013) (shown as the 4th row of the Table 1) or “rowcolumn overlap norm” for capturing special relationship among graphs. In two recent works, the penalty function at the 5th row is Table 1 has been used by Guo et al. (2011) and the penalty function at the 6th row of Table 1 has been used by Honorio and Samaras (2010).
Furthermore, there exist studies that explore similar motivations as ours when learning multiple GGM models from data. (a) Han et al. (2013) proposed to estimate a population graph from multiblock data using a socalled “mediangraph” idea. It is conceptually similar to \(\varOmega _S\). However, they do not have \(\varOmega ^{(i)}_I\) to model individual parts that are specific to each task. (b) Another recent study, CSSLGGM Hara and Washio (2013) also tried to model both the shared and individual substructures in multisGGMs. Different from ours, their formulation is within the penalized likelihood framework as Eq. (13). They used \(\ell \)1,p norm (see last row of Table 1) to regularize the taskspecific parts, while SIMULE uses \(\ell \)1 norm instead in Eq. (6). The \(\ell \)1,p norm pushes the individual parts of multiple graphs to be similar which is contradictory to the original purpose of these parameters.^{Footnote 3} (c) More recently, Monti et al. (2015) proposed to learn population and subjectspecific brain connectivity networks via a socalled “Mixed Neighborhood Selection” (MSN) method. Following the neighborhood selection framework (Meinshausen and Bühlmann 2006), for each node v, MSN tried to learn the neighborhood of each v. Similar to SIMULE, they estimated the neighborhood edges of a given node v in the itask as \(\beta ^v + \widetilde{\varvec{b}}^{(i),v}\). Here \(\beta ^v\) represents the neighbor in the shared part and \(\widetilde{\varvec{b}}^{(i),v}\) represents the neighbors that are specific to the ith graph. Since MSN is specially designed for brain imaging data, it assumes each individual graph is generated by random effects, i.e., \(\widetilde{\varvec{b}}^{(i),v} \sim N(0,\varPhi ^v)\). SIMULE does not have such strong assumptions on either taskspecific or taskshared substructures. Our model is more general while MSN is designed for brain imaging data. (d) Another line of related studies (Liu et al. 2013; Sugiyama et al. 2013; Fazayeli and Banerjee 2016) prosed densityratio based strategies to estimate a differential graph between two graphs. Even though this group of methods can handle the unbalance dataset (i.e., the numbers of samples in two datasets are quite different), they can only capture the difference between two graphs (\(K=2\)). SIMULE does not have such a limitation on the number of tasks. (e) Moreover, several loosely related studies exist in settings different from ours. For example, for handling highdimensional time series data a few recent papers have considered exploring multiple sGGMs by modeling relationships among networks; e.g., Kolar et al. (2010), Qiu et al. (2013).
4.2 Biasing covariance matrices with SIMONEI and SIMULEI
The SIMONE package (Chiquet et al. 2011) has introduced “intertwined Lasso” (named as “SIMONEI” in the rest of paper) that takes the perspective of sharing the information among covariance matrices. More specifically, this variation averages each task’s sample covariance matrix with a global empirical covariance matrix obtained from the whole dataset. Motivated by SIMONEI, we extend SIMULE with a similar strategy that revises each task’s sample covariance matrix with \(\tilde{ \varSigma }^{(i)} = \alpha \varSigma ^{(i)} + (1\alpha ) n_{tot}^{1}\sum _{t=1}^K {n_t\varSigma ^{(t)}} \). Here \(\alpha = 0.5\) . This variation is referred as “SIMULEI” for the rest of this paper. We report experimental results from both SIMONEI and SIMULEI in Sect. 6.
Fan et al. (2014) have pointed out that sample covariance matrix is inconsistent under a highdimensional setting. There exists a huge body of previous literature for covariance matrices estimation. Roughly relevant studies can be grouped into three types: (a) Sparse covariance matrix estimation, including studies like hardthresholding (Lam and Fan 2009), softthresholding (Tibshirani 1996), smoothly clipped absolution deviation (SCAD, Fan et al. 2013), and minimax concavity penalties (Zhang 2010). These methods, though simple, do not guarantee the positive definiteness of the estimated covariance matrix. (b) Positive definite sparse covariance matrix estimation that introduces penalty functions on the eigenvector space. Popular studies include Antoniadis and Fan (2011), Liu et al. (2014), Levina et al. (2008), Rothman (2012). (c) Factormodel based covariance matrix estimation like POET (Fan et al. 2013) that applies softthresholding to the residual space obtained after applying an approximate factor structure on the estimated covariance matrix. Most of these studies can be used to extend SIMULE. We leave a more thorough study of such combinations as future work.
4.3 Penalized loglikelihood for SIMULE
Comparing Eq. (9) of SIMULE with Eq. (13), previous multisGGM approaches mostly relied on penalized loglikelihood functions for learning. We have also considered extending penalized loglikelihood method such as the “graphical lasso” estimator into our MTL setting: \(\varOmega ^{(i)} = \varOmega ^{(i)}_I + \varOmega _S\). However, it is hard to deal with the log determinant term \(logdet(\varOmega ^{(i)}_I + \varOmega _S)\) in optimization. This is because \(\frac{\partial logdet(\mathbf {X}+ \mathbf {Y})}{\partial \mathbf {X}} = (\mathbf {X}+ \mathbf {Y})^{1}\) and it is difficult to calculate the inverse of \(\mathbf {X}+\mathbf {Y}\). There is no closed form for the inverse of the sum of two matrices, except for certain special cases. From the perspective of optimization, it is hard to directly use methods like coordinate descent for learning such a model due to this first derivative issue. The authors of CSSLGGM (Hara and Washio 2013) handled the issue by using Alternating Direction Method of Multipliers (ADMM) . This optimization is much more complicated than SIMULE. Another package MNS (Monti et al. 2015) tackled this issue partly through adding latent variables. It assumes that each individual part \(\widetilde{\varvec{b}}^{(i),v} \sim N(0,\varPhi ^v)\), in which \(\varPhi ^v\) is a latent variable learned by the EM algorithm.
4.4 Relevant studies using \(\ell _1\) optimization with constraints
The formulation of using \(\ell _1\) based objective with constraints has been explored by other tasks before. (a) For example, the Dantzig estimator (Candes and Tao 2007) uses the following \(\ell _1\) optimization with constraints for highdimensional linear regression:
This is in contrast to the lasso estimator using the penalized loglikelihood formulation for the highdimensional regression:
(b) For the case of singletask sGGM, the CLIME estimator (Cai et al. 2011) infers precision matrix through solving Eq. (1): \(\ell \)1 optimization with constraints (and further derived into a linear programming formulation). This has been shown to provide more favorable theoretical properties than penalized likelihood approaches like GLasso [solving Eq. (2)]. (c) Based on our knowledge, the proposed SIMULE model is the first study to use constrained \(\ell \)1 optimization for the case of multitasking sGGM. Table 2 summarizes relevant studies from the perspective of two different optimization formulations.
4.5 Optimization and computational concerns
Furthermore, a number of papers have focused on proposing ways to improve the performance of computation and data storage when estimating sGGMs. For example, the BigQUIC algorithm (Hsieh et al. 2011) aims at an asymptotic quadratic optimization when estimating sGGM. The authors of the Nodebased JGL method (Mohan et al. 2013) proposed a blockseparate method to improve the computational performance of multisGGMs. Our method can be extended to a parallel setting, since it can be naturally decomposed into per column based optimization. Nodebased JGL (Mohan et al. 2013) has also used the ADMM optimization algorithm for learning. Compared to linearprogrammingbased SIMULE, Mohan et al. (2013) has two main disadvantages: (a) a large number of complex optimizations are required to be solved, and (b) the condition of convergence is unspecified. SIMULE makes use of an efficient algorithm, is proven to achieve good numerical performance, and ensures convergence (see Sect. 5).
Furthermore, since imposing an \(\ell _1\) penalty on the model parameters formulates the structure learning of UGMs as a convex optimization problem, this strategy has shown successful results for modeling continuous data with GGM or NGM and discrete data with the pairwise Markov Random Fields (MRFs) (Friedman et al. 2008; Höfling and Tibshirani 2009). However, the discrete case of pairwise MRF is much harder because of the potentially intractable normalizing constant and also the possibility that each edge may have multiple parameters. One more complicating factor about structure learning is that the pairwise assumption might need a few exceptions, like searching for higherorder combinatorial interactions in recent studies like Schmidt and Murphy (2010), Buchman et al. (2012). (Detailed descriptions of related work are not included due to space limitation.)
5 Theoretical Analysis
5.1 Theoretical Analysis for Basic SIMULE
This section proves the theoretical properties of the SIMULE estimator. To present our analysis more concisely, we reformulate our model into a more general presentation as follows:
We first introduce the following important Lemma.
Lemma 1
For \(\varepsilon > 0\) and \(a,b \in \mathbb {R}\). If
where E is a constant. Then \(ab \ge 0\).
Proof
If \(ab<0\), let \(c = a + b\) and \(d = 0\). \(cd = 0 \ge 0\) and \(c+d = a+b\). Therefore, \(c + \varepsilon d = a+b<a < a + \varepsilon b\). This contradicts a, b are the optimal solution of Eq. (14). \(\square \)
Corollary 1
Assume \(\widehat{\varOmega }_I^{(i)}\) and \(\widehat{\varOmega }_S\) are the optimal solution of Eq. (7), then
\((\varOmega _S + \varOmega _I^{(i)})_1 = \varOmega _S_1 + \varOmega _I^{(i)}_1\)
Proof
By Lemma 1, we have that \(\widehat{\varOmega }_{I,j,k}^{(i)}\widehat{\varOmega }_{S,j,k} \ge 0\), if \(\widehat{\varOmega }_I^{(i)}\) and \(\widehat{\varOmega }_S\) are the optimal solution of Eq. (7). \(\square \)
Now we can rewrite our model in Eq. (7) as:
This because \(\varOmega _{tot}_1 + (\varepsilon  1)\varOmega _S^{tot}_1 = \sum \nolimits _{i=1}^K (\varOmega _S + \varOmega _I^{(i)})_1 + K(\varepsilon  1)\varOmega _S_1\). By Corollary (1), \(\varOmega _{tot}_1 + (\varepsilon  1)\varOmega _S^{tot}_1 = \sum \nolimits _i \varOmega ^{(i)}_I_1+ \varepsilon K\varOmega _S_1\).
We use \(\varSigma _{tot}^0\) to represent the true value of \(\varSigma _{tot}\) and \(\widehat{\varSigma }_{tot}\) as the estimated. We also use \(\varOmega ^0_{tot} = (\omega _1^0,\omega _2^0,\ldots ,\omega _{Kp}^0)\) to describe the true \(\varOmega _{tot}\) and \(\widehat{\varOmega }^1_{tot} = (\hat{\omega }^1_{ij})\) to denote the solution of the optimization problem in Eq. (15). The final solution is denoted as \(\widehat{\varOmega }_{tot} := (\hat{\omega }_{ij}) = (\hat{\omega }_1,\hat{\omega }_2,\ldots ,\hat{\omega }_{Kp})\) where \(\hat{\omega }_{ij} = \hat{\omega }_{ji} = \hat{\omega }^{1}_{ij}\mathop {\mathrm {sign}}( \max (\hat{\omega }^{1}_{ij} \hat{\omega }^{1}_{ji},0)) + \hat{\omega }^{1}_{ji} \mathop {\mathrm {sign}}( \max (\hat{\omega }^{1}_{ji} \hat{\omega }^{1}_{ij}, 0))\). Furthermore, we denote \({\mathbb E}[X^{(i)}]\) as \((\mu _1^{(i)},\mu _2^{(i)}, \ldots ,\mu _p^{(i)})^T\).
Notice that the constraint of Eq. (15) is not related to \(\varOmega _S^{tot}\). Therefore \(\widehat{\varOmega }^1_{tot}\) is also the solution of the following optimization problem.
In the following discussion, we split the analysis into two cases based on two different moment conditions for each \(X^{(i)}\).
(C1) Exponential tails Suppose there exists a constant \(0< \eta < 0.25\), so that \( \frac{\log p}{n_i} \le \eta \) and
where C is a constant.
(C2) Polynomialtype tails Suppose that for some \(\gamma \), \(c_1 > 0\), \(p \le c_1n_i^{\gamma }\), and for some \(\delta > 0\),
We begin by assuming that the precision matrix \(\varOmega \) belongs to the uniformity class of matrices,
Here q is a constant and \(0 \le q < 1\). \(\cdot _\mathbf{1}\) means the matrix 1norm. M is a constant representing the upper bound. \(\varOmega =: (\omega _{ij}) = (\omega _1,\ldots ,\omega _p)\). \(s_0(p)\) represents the sparsity level of \(\varOmega \) in the uniformity class. We would like to point out that \(s_0(p)\) is related to p, but there is no analytic form of relationship between \(s_0(p)\) and p.
Lemma 2
Let \(\{\hat{\varOmega }_{tot}^1 \}\) be the solution set of Eq. (7) and \(\{ \widehat{\mathbf {B}} \} := \{ (\widehat{\mathbf {B}}_I+ \widehat{\mathbf {B}}_S) \}\), where
Here \( \widehat{B}_I^{(i)} = (\widehat{\beta }_1^{(i)},\widehat{\beta }_2^{(i)},\ldots , \widehat{\beta }_p^{(i)})\) and \(\widehat{\mathbf {B}}_S = (\widehat{\beta }_1^s,\widehat{\beta }_2^s,\ldots ,\widehat{\beta }_p^s)\). For \(j \in \{1,2,\ldots , p\}\), \((\widehat{\beta }_j^{(1)},\widehat{\beta }_j^{(2)},\ldots ,\widehat{\beta }_j^{(K)},\widehat{\beta }_j^s)\) is an optimal solution of Eq. (9) when working on the jth column. Therefore, \(\{ \widehat{\varOmega }_{tot}^1 \} =\{ \widehat{\mathbf {B}} \}\).
Lemma 3
If each \(\varOmega _I^{(i)} + \varOmega _S\) satisfies Condition Eq. (19), then \(\varOmega _{tot}\) also satisfies Condition Eq. (19).
Proof
use the definition of \(\varOmega _{tot}\). \(\square \)
Corollary 2
\(\widehat{\varOmega }_{tot}\) satisfies the condition \(\widehat{\varOmega }_{tot} \succ 0\), with a high probability.
Proof
Lemma 3 indicates that \(\varOmega _{tot}\) satisfies Condition Eq. (19). \(\square \)
Theorem 3
Suppose that \(\varOmega ^0_{tot} \in \mathcal {U}(q,s_0(p))\). If \(\lambda _n \ge \varOmega ^0_{tot}_1(\max \nolimits _{ij}\hat{\sigma }_{ij}  \sigma ^0_{ij})\), we have that
and
where constant \(C_4 \le 2(1+2^{1q} + 3^{1q})(4\varOmega ^0_{tot}_1)^{1q}\) and constant \(C_5 \le 4 \varOmega ^0_{tot}_1 C_4\).
Theorem 4
Suppose that \(\varOmega ^0_{tot} \in \mathcal {U}(q,s_0(p))\) and \(n_{tot}=\sum _{i=1}^K n_i\).
(a) Assume that condition (C1) holds. Let \(\lambda _n = C_0M\sqrt{\frac{\log (Kp)}{n_{tot}}}\), where constant \(C_0 = 2\eta ^{2}(2 + \tau _0 + \eta ^{1}e^2C^2)^2\) and constant \(\tau _0 > 0\). Then
with a probability greater than \(1  4p^{\tau _0}\), where constant \(C_1 \le 2(1+2^{1q} + 3^{1q})4^{1q}C_0^{1q}\).
(b) Assume that condition (C2) holds. Let \(\lambda _{n} = C_2M\sqrt{\frac{\log (Kp)}{n_{tot}}}\), where constant \(C_2 = \sqrt{(5+\tau _0)(\theta _0 + 1)}\) and
Then
with a probability greater than \(1 O(n_{tot}^{\delta /8} + p^{\tau _0/2})\), where constant \(C_3 \le 2(1+2^{1q} + 3^{1q})4^{1q}C_2^{1q}\).
Through Eq. (23)–(29), we theoretically prove that we can achieve a good estimation of target dependency graphs with the convergence rate \(O(\log (Kp)/n_{tot})\). Based on CLIME (Cai et al. 2011), the convergence rate of singletask sGGM is \(O(\log p/n_i)\). Here \(n_{i}\) represents the number of samples of ith task. Assuming \(n_i = \frac{n_{tot}}{K}\), the convergence rate of single sGGM is \(O(K\log p/n_{tot})\). Clearly, since \(K\log p > \log (Kp)\), the convergence rate of SIMULE is better than singletask sGGM. This provides theoretical proofs for the benefit of multitasking sGGM. Neither of these theoretical results have been investigated by the previous studies.
All proofs of above theorems are provided in Sect. 7.
5.2 Theoretical analysis for NSIMULE estimator
In this subsection, we investigate the theoretical properties of NSIMULE estimator and prove that its convergence rate is the same as the SIMULE.
Theorem 5
Based on Liu et al. (2009), Suppose we use the estimated Kendall’s tau correlation matrix \(\widehat{\mathbf {S}}\) to replace \(\widehat{\varSigma }\) in the parametric GLasso estimator. Then under the same conditions on \(\widehat{\varSigma }\) (that ensure the consistency of the estimator under the Gaussian model), the nonparanormal estimator achieves the same (parametric) rate of convergence as GLasso estimator for both the precision matrix estimation and the graph structure recovery.
Theorem 6
We use the estimated Kendall’s tau correlation matrix \(\hat{\mathbf {S}}^{(i)}\) in the Nonparanormal SIMULE estimator. Then under the same conditions on \(\hat{\varSigma }^{(i)}\) (that ensure the consistency of the estimator under the Gaussian model), the nonparanormal SIMULE achieves the same rate of convergence as SIMULE (\(O(\sqrt{\log (Kp)/n_{tot}})\)) for both the graph recovery and precision matrix estimation.
Proof
We can directly apply Theorem 5 (the main theorem of Liu et al. 2009). Our multitask nonparanormal graph estimation will not change the convergence rate of SIMULE. \(\square \)
5.3 Potential nonidentifiability issue
Linear programming is not strongly convex. Therefore, there may be multiple ideal solutions in the SIMULE formulation of Eq. (7) (i.e., identifiability problem). In fact, the CLIME (Cai et al. 2011) estimator may also have multiple optimal solutions. Cai et al. (2011) have proved all such solutions converge to the true one at an optimal convergence rate. Similarly, in Sect. 5.1, we have proved that SIMULE formulation in Eq. (7) may result in multiple optimal solutions \(\{ \hat{\varOmega }_{tot} \}\). Each of these solutions \(\hat{\varOmega }_{tot}\) converges to the true solution with an optimal convergence rate. We present Theorem 7 showing that for each optimal solution \(\hat{\varOmega }_{tot}\), when \(\varepsilon \ne 1\), we can obtain unique estimation of \(\varOmega _S\) and \(\{ \varOmega _I^{(i)}  i = 1,\ldots ,K\}\).
Theorem 7
When we pick \(\varepsilon > 0\) and \(\varepsilon \ne 1\), for each optimal solution \(\hat{\varOmega }_{tot}\) from Eq. (7), there exist unique \(\varOmega _S\) and \(\{ \varOmega _I^{(i)}  i = 1,\ldots ,K\}\) satisfying Eq. (5).
We provide the proof of Theorem 7 in Sect. 7.
In practice, we need to decide \(\varepsilon \) according to the application for which SIMULE is used. For example, for the genomerelated biomedical data, we can normally assume that the shared subgraph is more dense than individual interactions of each context.^{Footnote 4} Therefore we pick \(\varepsilon < 1\) to reflect this assumption in our experiments in Sect. 6.
6 Experiments
In this section, we use four simulated datasets and two realworld biodatasets to evaluate the proposed estimators.
6.1 Experimental settings
6.1.1 Baselines
We compare SIMULE, SIMULEI and NSIMULE with the following baselines: (1) Three different multisGGM estimators including JGLfused, JGLgroup (Danaher et al. 2013), and SIMONE (Chiquet et al. 2011) (with the penalty functions described in Table 1); (2) The singletask CLIME baseline (i.e. each task uses CLIME independently); (3) The singletask nonparanormal CLIME (NCLIME) baseline (i.e., each task uses NCLIME independently). (4) The nonparanormal extension of JGLfused, JGLgroup, and SIMONE.^{Footnote 5} (5) SIMONEI (“intertwined Lasso” in SIMONE package) is added as a baseline for comparison with SIMULEI.^{Footnote 6}
6.1.2 Metric
The edgelevel false positive rate (FPR) and true positive rate (TPR) are used to measure the difference between the true graphs and the predicted graphs. By repeating each graph inference process 10 times, we obtain average metrics (also being averaged over K tasks) for each method we test. Here, FPR = \(\frac{\text {FP}}{\text {FP + TN}}\) and TPR = \(\frac{\text {TP}}{\text {TP + FN}}\). For a predicted graph, the TP (true positive) and TN (true negative) mean the number of true nonzero entries and true zero entries respectively. Each FPR vs. TPR curve shows the multipoint performance of a method over a range of its regularization parameter. In Tables 3, 4, 5 and 6 we compare estimators using the area under a FPR–TPR curve (AUC) and the partial area (partial AUC) under a FPR–TPR curve within the range of \(FPR < q\) (e.g., \(q=20\%\) or \(q=5\%\)). For instance, when we choose \(q=5\%\) we care about the partial AUC scores of those regions with very small FPRs. Higher AUC or partial AUC scores indicate that a method has achieved better results overall.
6.1.3 Selection of hyperparameter \(\lambda _{n}\)
Recent research studies from Negahban et al. (2009) and Yang et al. (2014) conclude that the regularization parameter \(\lambda \) of a singletask sGGM (e.g., with \(n_i\) samples) should satisfy \(\lambda \propto \sqrt{\frac{\log p}{n_i}}\). Combining this conclusion with our theoretical analysis in Sect. 5, we choose \(\lambda _{n} = \alpha \sqrt{\frac{\log (Kp)}{n_{tot}}}\) where \(\alpha \) for SIMULE or NSIMULE is a hyperparameter to tune. In our experiments, \(\alpha \) is varied over a range of \( \{0.05 \times i i\in \{1,2,3,\ldots ,30 \}\}\). The Bayesian information criterion (BIC) is used for situations requiring to select a specific value of hyperparameters.
Besides, we also need to tune the hyperparameters of the baseline methods to obtain their FPR–TPR curves. If only one hyperparameter needs tuning, we follow the same strategy as SIMULE. For those baselines (JGLfused and JGLgroup) having two hyperparameters, when given a certain \(\lambda _1\) (the same as \(\lambda _n\)), we use BIC criteria to select its best \(\lambda _2\) from a range of \(\{0.05\times i i\in \{1,2,3,\ldots ,20 \}\}\).
6.1.4 Selection of hyperparameter \(\varepsilon \)
\(\varepsilon \) reflects the difference of sparsity in the shared subgraph versus the contextspecific subgraphs. Section 5.3 has discussed our choice of \(\varepsilon \) on two realworld datasets. Similarly for the simulated experiments, we select \(\varepsilon \) from a range of \(\{0.1\times i i\in \{1,2,\ldots ,9\}\}\).
6.2 Simulated Gaussian datasets
Using the following two graph models, we first generate two synthetic multivariate Gaussian datasets, in which each model includes K tasks of data samples.

Model 1 Coming from Rothman et al. (2008), this model assumes \(\varOmega ^{(i)} = \mathbf {B}^{(i)}_I + \mathbf {B}_S + \delta ^{(i)}I\), where each offdiagonal entry in \(\mathbf {B}^{(i)}_I\) is generated independently and equal to 0.5 with probability 0.05i and 0 with probability \(1  0.05i\). The shared part \(\mathbf {B}_S\) is generated independently and equal to 0.5 with probability 0.1 and 0 with probability 0.9. \(\delta ^{(i)}\) is selected large enough to guarantee the positive definiteness of precision matrix. A clear shared structure \(\mathbf {B}_S\) exists among multiple graphs. We choose \(K \in \{2,3,4,5,6\}\) for this case.

Model 2 This model uses two specialstructure graphs, i.e., a grid graph and a ring graph. The first task uses a grid graph and the second uses a ring graph. These two special networks are popular in many realworld applications . For instance, certain biological pathways can be represented as rings. A clear shared structure exists between these two graphs. Clearly, \(K = 2\) for this case.
We choose \(p = 100\), i.e., the dimension of feature variables is 100. For each dataset, 500 data samples are generated randomly. Using either Model 1 or Model 2, we generate K blocks of data samples with the ith block following \(N(0, (\varOmega ^{(i)})^{1})\). In detail, for ith task, we generate 500 simulated data samples following multivariate Gaussian Distribution with mean 0 and covariance matrix \((\varOmega ^{(i)})^{1}\). We use the multivariate distribution method from stochastic simulation (Ripley 2009) to sample the simulated data blocks. In our implementation, we directly use the R function “mvrnorm” in MASS package.
In summary, we first simulate precision matrices by Model 1 or Model 2. We then use multivariate distribution method (Ripley 2009) to sample multivariate Gaussian Distributed data blocks with mean 0 and covariance matrix \((\varOmega ^{(i)})^{1}\). This stochastic procedure will generate simulated data blocks with the decomposition in Eq. (5). Then we apply SIMULE, NSIMULE and baseline models on these datasets to obtain the estimated dependency networks.
6.3 Simulated nonparanormal datasets
Using the same graphs of Model 1 and Model 2, we simulate two more sets of data samples following the nonparanormal distributions in K different tasks. We pick \(K=3\) for Model 1 and \(K =2\) for Model 2. Starting from \(N(0, (\varPhi ^{(i)})^{1})\) and transforming with a monotone function \(x : \rightarrow sign(\mathbf {x})\mathbf {x}^{\frac{1}{2}}\), data sample is generated as a random vector Z, where \(sign(Z)Z^2 = (sign(\mathbf {z}_1)\mathbf {z}_1^2,\ldots ,sign(\mathbf {z}_p)\mathbf {z}_p^2) \sim N(0, (\varPhi ^{(i)})^{1})\). Thus Z follows a nonparanormal distribution. Using different monotone functions for generating simulated data will not change the resulting correlation matrix \(\mathbf {S}\) since we use the rankbased nonparametric estimator estimating the correlation matrix.
6.4 Experimental results on synthetic datasets
Using Tables 3, 4, 5 and 6, we compare all methods using (a) the area under a FPR–TPR curve (AUC), (b) the area under a FPR–TPR curve within the range of \(FPR < 0.2\) (AUC\(20\%\)), (c) the area under a FPR–TPR curve within the range of \(FPR < 0.05\) (AUC\(5\%\)), (d) the AUC scores when evaluating only on the shared part (AUCshared), (e) the AUC scores when evaluating only on the individual parts of multiple graphs (AUCindividual), and (f) AUC scores when changing p (AUC\(p=200\)). Tables 3 and 4 include results on datasets from graph Model 1. Tables 5 and 6 are about Model 2. Tables 3 and 5 show comparisons on the simulated Gaussian datasets. Tables 4 and 6 are about results on the simulated nonparanormal datasets. Besides, Tables 4 and 6 also provide the results of the nonparanormal implementations of three baselines^{Footnote 7}—JGLfused, JGLgroup and SIMONE. In summary, we can conclude from these tables that (1) SIMULE and NSIMULE methods obtain better AUC and partial AUC than three multisGGM competitors, nonparanormal extensions of these three competitors, and two singlesUGM baselines across Model 1 and Model 2, and across two different values of p. (2) NSIMULE produces better scores than SIMULE on nonparanormal data across both graph models (more apparent on Model 1). Similarly, nonparanormal extensions of three multisGGM baselines outperform their Gaussianversions on the nonparanormal datasets. (3) All multitask estimators perform better than singleCLIME and singleNCLIME estimators. (4) On datasets from Model 2, the score differences among multitask estimators are not as apparent as the case of Model 1. (5) SIMULE and NSIMULE achieve better estimation of the individual subgraphs than other baselines. (6) The columns of “AUCshared” show that JGLfused and SIMONE estimate the shared parts better. This is as expected, since their penalty functions have enforced the shared similarity among graphs. (7) Interestingly, in Tables 3 and 5 NSIMULE achieves similar or even better AUC scores on Gaussian datasets. This is actually also as expected, since under a highdimensional setting the Kendalls tau correlation matrix used in NSIMULE has been proved to provide consistent estimation of correlation matrix with an optimal convergence rate (Liu et al. 2009). Differently, sample covariance matrix used by SIMULE is inconsistent for highdim cases (Fan et al. 2014). As a future step, large covarianceestimators like POET (Fan et al. 2014) will be used to replace samplecovariance in SIMULE. (8) Related to the previous point, SIMULEI and SIMONEI have revised contextspecific sample covariance matrices with a bias towards a global empirical covariance matrix over all K tasks. The last two rows in Tables 3 and 5 clearly show that such a simple strategy on covariance estimation has provided significant performance improvement over both SIMULE and SIMONE.
Figure 3 shows more detailed FPR–TPR comparisons among the joint sGGM estimators. The subfigure (a) “GaussianModel1” clearly shows that our methods obtain better curves than three multisGGM baselines. On the subfigure (b) “GaussianModel2”, the differences among multisGGM estimators are not as apparent as “GaussianModel1”. Figure 3c, d (in the second row) show FPR–TPR curves from two nonparanormal datasets. Overall, the observations from comparing these curves are consistent with those obtained from the four tables (Tables 3, 4, 5, 6).
Since the difference among multitask estimators is not apparent on the datasets from Model 2, we further visualize the derived graph structures from all joint sGGM estimators using a blackwhite colormap visualization (i.e., black for nonzero entries and white for zero entries of precision matrix) in Fig. 4. The left two columns of Fig. 4 present all color maps we have obtained on “GaussianModel 2”. The right two columns show all color maps from “nonparanormalModel 2”. The first row represents the original grid graph and ring graph from Model 2. Since these two graph structures are clearly related, we can observe a diagonal band shared by the two colormaps. “Graph 1” grid graph has two more offdiagonal narrow bands while “Graph 2” ring graph does not. The color maps of predicted graphs with proper tuning parameters from the NSIMULE and SIMULE method s are presented as the second row and the third row. Three other multisGGM baselines are shown on the fourth, fifth and the sixth rows. Instead of pushing two graphs to be too similar, SIMULE and NSIMULE methods achieve a better recovery of both the ring graph and the grid graph. In contrast, three multitask baseline methods all obtain an incorrect graph on the second task (ring graph) by pushing it to be too similar to the first task (grid graph). The single task baselines (CLIME as the 8th row and NCLIME as the 7th row) obtain two much worse graphs because they do not model the relationship among the two graphs at all.
One important hyperparameter we need to pick for SIMULE and NSIMULE is \(\varepsilon \) in Eq. (7). This hyperparameter reflects the sparsity level of the shared subgraph against the contextspecific parts. The left two subfigures (a) and (c) of Fig. 5 show the changes of sparsity level in both individual and shared subgraphs across multiple values of \(\varepsilon \) (by running SIMULE on the GaussianModel 1 case). Figure 5a is for \(K=3\) and Fig. 5c is for \(K= 6\). We can see that when \(\varepsilon \) increases, the sparsity level of shared portion decreases while the averaging sparsity of individual parts increases across both cases of K. This matches our analysis in Sect. 5.3. In real applications, \(\varepsilon \) indicates the differences of sparsity constraints we assume on shared and individual parts. It should be chosen according to the domain knowledge of a specific application.
In addition, Fig. 5e tries to investigate whether the changes of \(\varepsilon \) influence the performance (AUC) of (N)SIMULE. On datasets from Model 1, Fig. 5d shows that AUC scores of both SIMULE and NSIMULE exhibit very small variations across a large range of changing \(\varepsilon \).
Section 2.5 presents a parallel variation of SIMULE. On the simulated data of GaussianModel 1, we compare the training speed of originalSIMULE versus parallelSIMULE using Fig. 5b (the right upper subfigure). For the parallelSIMULE, we run SIMULE by paralleling “column per core” using 63 cores on a 64core machine. The baselines, including the originalSIMULE, JGLfused, JGLgroup, SIMONE and SIMONEI, are run on one restricted core on the same machine.^{Footnote 8} Figure 5b provides the computational speed (training time in \(\log \)seconds) across values of dimension p. It clearly shows that the parallelSIMULE runs much faster than the singlecore SIMULE implementation and other baselines.^{Footnote 9}
Furthermore, Fig. 5d, f provide AUC scores of SIMULE, three multiSGGM baselines, SIMONEI and SIMULEI against the varying p and against the with varying K. SIMULEI provides a consistent performance improvement over SIMULE. SIMULE outperforms three multisGGM baselines across multiple values of p and K. SIMULEI outperforms SIMONEI and runs smoothly for larger values of K. Unfortunately Fig. 5f can not provide the AUC scores of SIMONEI for \(K=5\) and \(K=6\), because the SIMONE package could not converge for some \(\lambda \) values under these two cases of K.
6.5 Experiment results on real application I: identifying gene interaction using gene expression data across two cell contexts
Next, we apply SIMULE and the baselines on one realworld biomedical data: gene expression profiles describing many human samples across multiple cancer types (aggregated by McCall et al. 2011). Recently advancements in genomewide monitoring have resulted in enormous amounts of data across most of the common cell contexts, like multiple common cancer types (The Cancer Genome Atlas Research Network 2011). Complex diseases such as cancer are the result of multiple genetic and epigenetic factors. Thus, recent research has shifted towards the identification of multiple genes/proteins that interact directly or indirectly in contributing to certain disease(s). Structure learning of UGMs on such heterogeneous datasets can uncover statistical dependencies among genes and understand how such dependencies vary from normal to abnormal or across different diseases. These structural variations are highly likely to be contributing markers that influence or cause the diseases.
Two major cell contexts are selected from the human expression dataset provided by McCall et al. (2011): leukemia cells (including 895 sample s) and normal blood cells (including 227 samples). Then we choose the top 1000 features from the total 12,704 features (ranked by variance) and apply SIMULE^{Footnote 10}), JGLfused, JGLgroup, SIMONE, singleCLIME, SIMONEI and SIMULEI on this twotask dataset. The derived dependency graphs are compared by using the number of predicted edges being validated by three major existing protein/gene interaction databases (Prasad et al. 2009; Orchard et al. 2013; Stark et al. 2006).^{Footnote 11} The numbers of matches between interactions in databases and those edges predicted by each method have been shown as a bar graph in the left subfigure of Fig. 6. This bar graph clearly shows that SIMULE consistently outperforms three multisGGM baselines and CLIME on both individual and shared interactions from both cell contexts. Interestingly, SIMULEI outperforms SIMONEI and SIMULE. SIMONEI achieves a better recovery of known edges than JGLgroup, JGLfused, SIMONE and CLIME. This leaves us to believe that exploring more choices of covariance matrix estimation to extend SIMULE is promising.
6.6 Experiment results on real application II : identifying collaborations among TFs across multiple cell types
In molecular biology, the regulatory proteins that interact with one another to control gene transcription are known as transcription factors (TFs). TF proteins typically perform major cell regulatory functions (e.g., binding on DNA) by working together with other TFs. The collaboration patterns (e.g., conditional independence) among TFs normally vary across different cell contexts (e.g., cell lines). Meanwhile, a certain portion of the TF interactions are preserved across contexts. Understanding the collaboration networks among TFs is the key to understanding cell development, including defects, which lead to different diseases. The ChIPSeq datasets recently made available by the ENCODE project (ENCODE Project Consortium 2011) provide simultaneous binding measurements of TFs to thousands of gene targets. These measurements provide a “snapshot” of TF binding events across many cell contexts. The task of uncovering the functional dependencies among TFs connects to the task of discovering the statistical dependencies among TFs from their ChIPSeq measurements.
Recently, two relevant papers (Cheng et al. 2011; Min et al. 2014) have discussed methods to infer coassociation networks among TFs using ChIPSeq data. Their approaches differ from ours as both projects has targeted a single cell type at a time. We select ChIPSeq data for 27 TFs that are covered by ENCODE (ENCODE Project Consortium 2012) across three major human cell lines including (1) H1hESC (embryonic stem cells : primary tissue), (2) GM12878 (Blymphocyte:normal tissue) and (3) K562 (leukemia:cancerous tissue).
We apply SIMULE, SIMULEI, JGLgroup, JGLfused, SIMONE and SIMONEI to this multicell TF ChIPSeq dataset. Comparisons of different methods are performed using three major existing protein interaction databases (Prasad et al. 2009; Orchard et al. 2013; Stark et al. 2006). The numbers of matches between TFTF interactions in databases and those predicted by each method have been plotted as a bar graph shown in Fig. 6b (the right subfigure). The graph shows that SIMULE consistently outperforms JGLfused, JGLgroup and SIMONE on both individual and shared interactions from all three cell types. SIMULEI performs better than SIMONEI and SIMULE. We further evaluated the resulting TF interactions using the popular “functional enrichment” analysis with DAVID (Da Wei Huang and Lempicki 2008) and found that SIMULE and SIMULEI can reveal known functional sets and potentially novel interactions that drive leukemia. This leads us to believe that our approach can be used in a wider range of applications as well.
Many domainspecific studies have applied sGGM on realworld datasets, especially those from molecular biology or brain science. For example, Ma et al. (2007) estimates gene networks using miroarray data in the model species Arabidopsis. Krumsiek et al. (2011) explores GGM to reconstruct pathway reactions from highthroughput metabolomics data. Within brain science, a few studies (Ng et al. 2013; Huang et al. 2010; Monti et al. 2015; Sun et al. 2009) have tried to learn brain connectivity of Alzheimer’s disease through sparse inverse covariance estimation. Due to space limit, we omit detailed inclusions of these studies.
7 Proof of Theorems
In this section, I(S) denotes the indicator function of the set S.
7.1 Lemma 2
Proof
In the following, we just show the case of \(1 \le j \le p\). When \(j > p\), by replacing \(e_{j \mod p}\) in the following proofs, it will still hold. \(\varSigma _{tot}\hat{\varOmega }_{tot}^1  I_{\infty } \le \lambda _n \) is equivalent to \(\varSigma _{tot}\hat{\omega }_j^1  e_j_{\infty } \le \lambda _n\) for \(1 \le j \le p\).
Thus we have
Also \(\varSigma _{tot}\hat{B}  I_{\infty } \le \lambda _n\). So by the definition of \(\{\widehat{\varOmega }_{tot}^1 \}\), we have that
With Inequality (30) and Inequality (31), we have that \(\widehat{\mathbf {B}} \in \{\widehat{\varOmega }_{tot}^1\}\). On the other hand, if \(\widehat{\varOmega }_{tot}^1 \notin \{ \widehat{\mathbf {B}} \}\), then \(\exists i \text { such that } \hat{\omega }_j^1_1 > \hat{\beta }_j + \hat{\beta }^s_j_1\). Thus by Inequality (30), we have that \(\widehat{\varOmega }^1_1 > \widehat{\mathbf {B}}_1\). This is in conflict with Inequality (31). \(\square \)
7.2 Theorem 3
Proof
By the condition in Theorem 3,
we then have
where \(\mathbf {A}\mathbf {D}_{\infty } \le \mathbf {A}_{\infty }\mathbf {D}_1\) for any matrices \(\mathbf {A}\),\(\mathbf {D}\) of appropriate sizes.
Since \(\varOmega _{tot}^0\) also satisfies Eq. (16) and \(\widehat{\varOmega }_{tot}^1\) is the optimal solution which satisfies Eq. (16)
we have that
Therefore, based on Inequality (32) to Inequality (35)
Then,
This proves Inequality (20) of Theorem 3.
To prove the Inequality (21), let \(t_n = \widehat{\varOmega }_{tot}  \varOmega _{tot}^0_{\infty }\) and define
By the definition of \(\widehat{\varOmega }_{tot}\), we have that \(\hat{\omega }_j_1\le \hat{\omega }^1_j_1 \le \omega ^0_j_1\). Then
which implies that \(\mathbf {h}^2_j_1 \le \mathbf {h}^1_j_1\). It follows that \(\mathbf {h}_j_1 \le 2\mathbf {h}_j^1_1\). Thus we only need to get the upper bound of \(\mathbf {h}_j^1_1\). We have that
This proves the Inequality (21) of Theorem 3. For any \(a,b,c\in {\mathbb R}\), we have that \(I\{ a<c\}  I\{ ab<c\}\le I\{ bc<ab\}\)
Finally, Inequality (22) can be derived from Inequality (20), Inequality (36) and the inequality relationship \(\mathbf {A}^2_{F} \le p\mathbf {A}_1\mathbf {A}_{\infty }\). \(\square \)
7.3 Theorem 4
Proof
Theorem 4(a)
In the beginning of Sect. 5, we define that \(\varSigma _{tot} :=(\sigma _{ij})_{Kp\times Kp} \). By Theorem 3, we only need to prove
with probability greater than \(14p^{\tau _0}\) under (C1). Without loss of generality, we assume that \({\mathbb E}X^{(i)} = 0\). Let \(\hat{\varSigma }_{tot}^0 := n_{tot}^{1}\sum \limits ^n_{k=1}{\mathbf {X}_{tot}}_k{\mathbf {X}_{tot}}_k^T\) and \(\mathbf {Y}_{kij} = {\mathbf {X}_{tot}}_{ki}{\mathbf {X}_{tot}}_{kj}  {\mathbb E}{\mathbf {X}_{tot}}_{ki}{\mathbf {X}_{tot}}_{kj}\). We then have \(\hat{\varSigma }_{tot} = \hat{\varSigma }_{tot}^0 \bar{\mathbf {X}}_{tot}\bar{\mathbf {X}}_{tot}^T\). Let \(t = \eta \sqrt{\log Kp/n_{tot}}\).
Now we use the inequality \(e^s  1  s< s^2e^{\max (s,0)}\) for any \(s \in {\mathbb R}\) and let
with a few step of basic calculations we can get
Thus, we have
Then using the simple inequality \(e^s \le {e}^{s^2+1}\) for \(s>0\), we have
Let
and \(a_n = C_{K2}^2(\log Kp /n_{tot})^{\frac{1}{2}}\). As before, we can show that
By Inequality (39), Inequality (42), and the inequality formulation \(C_0 > \eta ^{1} C_{K1} + \eta ^{2}a_n\), we can prove that Inequality (37) holds. \(\square \)
Proof
Theorem 4(b)
Let \(\bar{\mathbf {Y}}_{kij} = \mathbf {X}_{ki}\mathbf {X}_{kj}I\{\mathbf {X}_{ki}\mathbf {X}_{kj}\le \sqrt{n_{tot}/(\log Kp)^3}\}  {\mathbb E}\mathbf {X}_{ki}\mathbf {X}_{kj}I\{\mathbf {X}_{ki}\mathbf {X}_{kj}\le \sqrt{n_{tot}/(\log Kp)^3}\}\), \(\check{\mathbf {Y}}_{kij} = \mathbf {Y}_{kij}  \bar{\mathbf {Y}}_{kij}\).
Since
\(b_n := \max _{i,j}{\mathbb E}\mathbf {X}_{ki}\mathbf {X}_{kj}I\{\mathbf {X}_{ki}\mathbf {X}_{kj}\le \sqrt{n_{tot}/(\log Kp)^3}\} = O(1)n_{tot}^{\gamma  \frac{1}{2}}\), we have, by condition (C2),
By Bernstein’s inequality (cf. Bennett 1962) and some elementary calculations,
Therefore, we have
By using the same truncation argument and Bernstein’s inequality, we can show that \({\mathbb P}(\max \nolimits _i\sum \nolimits _{k=1}^{n_{tot}}\mathbf {X}_{ki}\ge \sqrt{\max \nolimits _i \sigma ^0_{ii}(4+\tau _0)n_{tot}\log Kp}) = O(n_{tot}^{\delta /8} + p^{\tau _0/2})\).
Therefore,
Combining Eq. (43) and Eq. (44), then, we have
with a probability greater than \(1O(n_{tot}^{\delta /8} + p^{\tau _0/2})\). The proof is completed by Inequality (45) and Theorem 3. \(\square \)
7.4 Theorem 7
Proof
Assume that there exists two different optimal solution \((\hat{\varOmega }^{(i)}_I, \hat{\varOmega }_S)\) and \((\hat{\varOmega }^{(i)\prime }_I, \hat{\varOmega }_S')\), which satisfies Eq. (7) and \(\hat{\varOmega }^{(i)}_I + \hat{\varOmega }_S = \hat{\varOmega }^{(i)\prime }_I + \hat{\varOmega }_S' \). Since they are the optimal solutions, by Lemma 1, we have that \(\hat{\varOmega }^{(i)}_{I,j,k}\hat{\varOmega }_{S,j,k} \ge 0\) and \(\hat{\varOmega }^{(i)\prime }_{I,j,k}\hat{\varOmega }_{S,j,k}' \ge 0\). Therefore, \(\hat{\varOmega }^{(i)}_{I,j,k} + \hat{\varOmega }_{S,j,k} = \hat{\varOmega }^{(i)}_{I,j,k} + \hat{\varOmega }_{S,j,k}\) and \(\hat{\varOmega }^{(i)\prime }_{I,j,k} + \hat{\varOmega }_{S,j,k}' = \hat{\varOmega }^{(i)\prime }_{I,j,k} + \hat{\varOmega }_{S,j,k}'\). Now we get
Then we denote \(\hat{\varOmega }^{(i)} = \hat{\varOmega }^{(i)}_I + \hat{\varOmega }_S\) and \(\hat{\varOmega }^{(i)\prime } = \hat{\varOmega }^{(i)\prime }_I + \hat{\varOmega }_S'\). Combining Eq. (7), Eq. (5) and Eq. (46), \((\hat{\varOmega }^{(i)}, \hat{\varOmega }_S)\) and \((\hat{\varOmega }^{(i)\prime }, \hat{\varOmega }_S')\) are both the optimal solution of
Because the constrains are not related to \(\varOmega _S\) and because we have that \(\hat{\varOmega }^{(i)}_{I,j,k} + \hat{\varOmega }_{S,j,k} = \hat{\varOmega }^{(i)}_{j,k} = \hat{\varOmega }^{(i)\prime }_{j,k} = \hat{\varOmega }^{(i)\prime }_{I,j,k} + \hat{\varOmega }_{S,j,k}'\). By Lemma 1, \(\hat{\varOmega }^{(i)}_{I,j,k} + \hat{\varOmega }_{S,j,k} = \hat{\varOmega }^{(i)\prime }_{I,j,k} + \hat{\varOmega }_{S,j,k}' = \hat{\varOmega }^{(i)}_{j,k}\). When \(\varepsilon < 1\), the optimization of Eq. (47) changes to the following:
Equation (48) gives the unique solution \(\hat{\varOmega }_{S,j,k} = \min \nolimits _i(\hat{\varOmega }_{j,k}^{(i)}) \mathop {\mathrm {sign}}({\mathop {\hbox {argmin}}\nolimits _{\hat{\varOmega }_{j,k}^{(i)}}} \hat{\varOmega }_{j,k}^{(i)})\). The uniqueness contradicts the assumption of \(\hat{\varOmega }_S \ne \hat{\varOmega }_S'\). For \(\varepsilon > 1\), similarly we can prove, there exist unique \(\varOmega _S\) and \(\{ \varOmega _I^{(i)}  i = 1,\ldots ,K\}\). \(\square \)
8 Conclusions
This paper introduces a novel method SIMULE for learning shared and distinct patterns simultaneously when inferring multiple sGGMs or sNGMs jointly. Through \(\ell \)1constrained formulation our solution is efficient and can be parallelized. We successfully apply SIMULE on four synthetic datasets and two realworld datasets. We prove the convergence property of SIMULE to be favorable and justify the benefit of multitasking. Future work will extend SIMULE to model more complex relationships among contexts.
Notes
Following convention, \(\varSigma \), \(\varOmega \), \(\varPhi \), \(\beta \), \(\mu \), \(\theta \) and I are not bold.
We can not find CSSLGGM implementation, therefore can not include it as a baseline.
This assumes more interactions are preserved across cell contexts, i.e., partly due to concerns of system or evolutionary stability.
We extend JGL and SIMONE on nonparanormal distributions by replacing the codes for sample covariance matrix into Kendall’s tau correlation matrix in their R implementations.
It is possible to combine nonparanormal and intertwined strategies to extend SIMULE. We leave this as a future work.
We revise the R packages of three baselines to extend them to nonparanormal distribution. This is the provide a more fair comparison of these baselines versus NSIMULE.
The multicore setting we use for this time experiment is to reflect the distributed parallel nature of SIMULE. We leave the topic of using the multithreading to improve SIMULE as future research.
When \(p \ge 400\), SIMONEI takes more than 5 days to train. That’s why the no data points are shown in Fig. 5b for such cases.
NSIMULE was tried as well and has achieved the same validation result as SIMULE.
We would like to point out that the interactions SIMULE finds represent statistical dependencies between genes that vary across multiple cell types. There exist many possibilities for such interactions, including like physical proteinprotein interactions, regulatory gene pairs or signaling relationships. Therefore, we combine multiple existing databases for a joint validation.
Abbreviations
 \(\varSigma \) :

Covariance matrix
 \(\mu \) :

Mean vector in the Gaussian distribution
 \(\varOmega \) :

Precision matrix
 \(\mathbf {X}\) :

Data sample matrix
 \(X_j\) :

A random variable follows the Gaussian Distribution
 \(\varSigma ^{(i)}\) :

ith Covariance matrix
 \(\varOmega ^{(i)}\) :

ith Precision matrix in a multitask setting
 \(\varOmega _S\) :

Shared pattern among all precision matrices in a multitask setting
 \(\varOmega ^{(i)}_I\) :

Individual part of ith Precision matrix in a multitask setting
 \(\mathbf {X}^{(i)}\) :

ith data sample matrix in a multitask setting
 \(n_i\) :

Number of samples in ith data matrix
 \(n_{tot}\) :

Total number of samples in a multitask setting
 \(\beta \) :

A column of \(\varOmega \)
 \(\beta ^{(i)}\) :

A column of \(\varOmega ^{(i)}\)
 \(\mathbf {x}\) :

A pdimensional sample
 \(\mathbf {e}_j \) :

\( (0,\ldots ,1,\ldots ,0)^T\)
 \(\theta \) :

A parameter used in linear programming formulation [Eq. (10)], \([\beta ^{(1)^T}, \ldots , \beta ^{(i)^T},\ldots ,\beta ^{(K)^T},\varepsilon K{(\beta ^s)}^T]^T\)
 \(\mathbf {A}^{(i)}\) :

A parameter used in linear programming formulation [Eq. (10)], \([0,\ldots ,0,\varSigma ^{(i)},0,\ldots ,0,\frac{1}{\varepsilon K}\varSigma ^{(i)}]\)
 \(\varvec{b}\) :

A parameter used in linear programming formulation [Eq. (10)]. Equals to \(\mathbf {e}_j\)
 Z :

A random variable follows the nonparanormal distribution
 \(\mathbf {S}\) :

Correlation matrix of Z
 \(\varSigma _{tot}\) :

Defined in Sect. 5.1
 \(\varOmega _I\) :

Defined in Sect. 5.1
 \(\varOmega _S^{tot}\) :

Defined in Sect. 5.1
 \(\varOmega _{tot}\) :

Defined in Sect. 5.1
 \(X_{tot}\) :

Defined in Sect. 5.1
 \(I_K\) :

Defined in Sect. 5.1
 \(\sigma _{ij}\) :

An entry of \(\varSigma _{tot}\)
 \(\omega _{ij}\) :

An entry of \(\varOmega _{tot}\)
 \(\omega _{j}\) :

A column of \(\varOmega _{tot}\)
 \(\hat{\varOmega }_{tot}\) :

Estimated \(\varOmega _{tot}\)
 \(\varOmega ^0_{tot}\) :

True \(\varOmega _{tot}\)
 \(\omega ^0_j\) :

A column of \(\varOmega ^0_{tot}\)
 \(\hat{\varSigma }_{tot}\) :

Estimated \(\varSigma _{tot}\)
 \(\varSigma ^0_{tot}\) :

True \(\varSigma _{tot}\)
 \(\hat{\omega }_{j}\) :

A column of \(\hat{\varOmega }_{tot}\)
 \(\hat{\varOmega }^1_{tot}\) :

Solution of Eq. (15)
 \(\lambda _n\) :

Hyperparameter for the sparsity in Eq. (7)
 \(\varepsilon \) :

Hyperparameter balancing shared and individual part in Eq. (6)
 K :

Total number of tasks
 p :

Total number of features
 C :
 q :

A constant between 0 and 1 [Eq. (19)]
 \(\eta \) :

A constant between 0 and 0.25 in condition (C1) [Eq. (17)]
 \(\gamma ,\delta \) :

Two constants in condition (C2) [Eq. (18)]
 M :

A constant represents the upper bound in Eq. (19)
 \(s_0(p)\) :

A constant represents the sparsity level of \(\varOmega \) in Eq. (19)
 \(\hat{\mathbf {B}}, \hat{\mathbf {B}}_I, \hat{\mathbf {B}}_S\) :

Solution of Eq. (9)
 \(\hat{\varvec{b}}^{(i)}_j\) :

A column of \(\hat{\mathbf {B}}_I\)
 \(\hat{\varvec{b}}_j^S\) :

A column of \(\hat{\mathbf {B}}_I\)
 \(\tau _0\) :

A constant in Theorem 4
 \(C_4,C_5\) :

Constants in Theorem 3
 \(C_0,C_1\) :

Constants in Theorem 4(a)
 \(\theta _0 \) :

Equals to \(\max \nolimits _{i,j,k} \hat{\varSigma }_{j,k}^{(i)}\)
 \(C_2,C_3\) :

Constants in Theorem 4(b)
 \(\hat{\sigma }_{ij}\) :

An entry of \(\hat{\varSigma }_{tot}\)
 \(\sigma ^0_{ij}\) :

An entry of \(\varSigma _{tot}^0\)
 \(C_{K1},C_{K2}\) :

Two constants used in the proof of Theorem 4(b)
 \(\mathbf {h}_j \) :

\( \hat{\omega }_j  \omega ^0_j\)
 \(\mathbf {h}_j^1 \) :

\( (\hat{\omega }_{ij}I\{ \hat{\omega }_{ij}\ge 2t_n \}; 1\le i \le p)^T  \omega _j^0\)
 \(\mathbf {h}_j^2 \) :

\( \mathbf {h}_j  \mathbf {h}_j^1\)
 \(\bar{\mathbf {Y}}_{kij} \) :

\( \mathbf {X}_{ki}\mathbf {X}_{kj}I\{\mathbf {X}_{ki}\mathbf {X}_{kj}\le \sqrt{n_{tot}/(\log Kp)^3}\}  {\mathbb E}\mathbf {X}_{ki}\mathbf {X}_{kj}I\{\mathbf {X}_{ki}\mathbf {X}_{kj}\le \sqrt{n_{tot}/(\log Kp)^3}\}\)
 \(\check{\mathbf {Y}}_{kij} \) :

\( \mathbf {Y}_{kij}  \bar{\mathbf {Y}}_{kij}\)
 \(b_n \) :

\( \max _{i,j}{\mathbb E}\mathbf {X}_{ki}\mathbf {X}_{kj}I\{\mathbf {X}_{ki}\mathbf {X}_{kj}\le \sqrt{n_{tot}/(\log Kp)^3}\}\)
 \(\mathbf {Y}_{kij} \) :

\( {\mathbf {X}_{tot}}_{ki}{\mathbf {X}_{tot}}_{kj}  {\mathbb E}{\mathbf {X}_{tot}}_{ki}{\mathbf {X}_{tot}}_{kj}\)
 \(\varPhi ^{(i)} \) :

The inverse of ith correlation matrix for the nonparanormal case
References
Antoniadis, A., & Fan, J. (2011). Regularization of wavelet approximations. Journal of the American Statistical Association, 96(455), 939–967.
Banerjee, O., El Ghaoui, L., & d’Aspremont, A. (2008). Model selection through sparse maximum likelihood estimation for multivariate Gaussian or binary data. The Journal of Machine Learning Research, 9, 485–516.
Boyd, S. P., & Vandenberghe, L. (2004). Convex optimization. Cambridge: Cambridge University Press.
Buchman, D., Schmidt, M., Mohamed, S., Poole, D., & de Freitas, N. (2012). On sparse, spectral and other parameterizations of binary probabilistic models. In AISTATS (pp. 173–181)
Cai, T., Liu, W., & Luo, X. (2011). A constrained 1 minimization approach to sparse precision matrix estimation. Journal of the American Statistical Association, 106(494), 594–607.
Candes, E., & Tao, T. (2007). The Dantzig selector: Statistical estimation when p is much larger than n. The Annals of Statistics, 35(6), 2313–2351.
Caruana, R. (1997). Multitask learning. Machine Learning, 28(1), 41–75.
Cheng, C., Yan, K. K., Hwang, W., Qian, J., Bhardwaj, N., Rozowsky, J., et al. (2011). Construction and analysis of an integrated regulatory network derived from highthroughput sequencing data. PLoS Computational Biology, 7(11), e1002190.
Chiquet, J., Grandvalet, Y., & Ambroise, C. (2011). Inferring multiple graphical structures. Statistics and Computing, 21(4), 537–553.
Da Wei Huang, B. T. S., & Lempicki, R. A. (2008). Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nature Protocols, 4(1), 44–57.
Danaher, P., Wang, P., & Witten, D. M. (2013). The joint graphical lasso for inverse covariance estimation across multiple classes. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 76(2), 373–397.
Di Martino, A., Yan, C. G., Li, Q., Denio, E., Castellanos, F. X., Alaerts, K., et al. (2014). The autism brain imaging data exchange: Towards a largescale evaluation of the intrinsic brain architecture in autism. Molecular Psychiatry, 19(6), 659–667.
ENCODE Project Consortium. (2011). A user’s guide to the encyclopedia of DNA elements (ENCODE). PLoS Biology, 9(4), e1001046.
ENCODE Project Consortium. (2012). An integrated encyclopedia of DNA elements in the human genome. Nature, 489(7414), 57–74.
Evgeniou, T., & Pontil, M. (2004). Regularized multitask learning. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 109–117). ACM.
Fan, J., Han, F., & Liu, H. (2014). Challenges of big data analysis. National Science Review,. doi:10.1093/nsr/nwt032.
Fan, J., Liao, Y., & Mincheva, M. (2013). Large covariance estimation by thresholding principal orthogonal complements. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 75(4), 603–680.
Fazayeli, F., & Banerjee, A. (2016). Generalized direct change estimation in ising model structure. arXiv preprint arXiv:1606.05302.
Friedman, J., Hastie, T., & Tibshirani, R. (2008). Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 9(3), 432–441.
Guo, J., Levina, E., Michailidis, G., & Zhu, J. (2011). Joint estimation of multiple graphical models. Biometrika,. doi:10.1093/biomet/asq060.
Han, F., Liu, H., & Caffo, B. (2013). Sparse median graphs estimation in a high dimensional semiparametric model. arXiv preprint arXiv:1310.3223.
Hara, S., & Washio, T. (2013). Learning a common substructure of multiple graphical Gaussian models. Neural Networks, 38, 23–38.
Hastie, T., Tibshirani, R., Friedman, J., Hastie, T., Friedman, J., & Tibshirani, R. (2009). The elements of statistical learning. Berlin: Springer.
Höfling, H., & Tibshirani, R. (2009). Estimation of sparse binary pairwise markov networks using pseudolikelihoods. The Journal of Machine Learning Research, 10, 883–906.
Honorio, J., & Samaras, D. (2010). Multitask learning of Gaussian graphical models. In Proceedings of the 27th international conference on machine learning (ICML10) (pp. 447–454).
Hsieh, C. J., Sustik, M. A., Dhillon, I. S., & Ravikumar, P. D. (2011). Sparse inverse covariance matrix estimation using quadratic approximation. In NIPS (pp. 2330–2338).
Huang, S., Li, J., Sun, L., Ye, J., Fleisher, A., Wu, T., et al. (2010). Learning brain connectivity of alzheimer’s disease by sparse inverse covariance estimation. NeuroImage, 50(3), 935–949.
Ideker, T., & Krogan, N. J. (2012). Differential network biology. Molecular Systems Biology, 8(1), 565.
Kelly, C., Biswal, B. B., Craddock, R. C., Castellanos, F. X., & Milham, M. P. (2012). Characterizing variation in the functional connectome: Promise and pitfalls. Trends in Cognitive Sciences, 16(3), 181–188.
Kolar, M., Song, L., Ahmed, A., Xing, E. P., et al. (2010). Estimating timevarying networks. The Annals of Applied Statistics, 4(1), 94–123.
Krumsiek, J., Suhre, K., Illig, T., Adamski, J., & Theis, F. J. (2011). Gaussian graphical modeling reconstructs pathway reactions from highthroughput metabolomics data. BMC Systems Biology, 5(1), 21.
Lam, C., & Fan, J. (2009). Sparsistency and rates of convergence in large covariance matrix estimation. Annals of Statistics, 37(6B), 4254.
Lauritzen, S. L. (1996). Graphical models. Oxford: Oxford University Press.
Levina, E., Rothman, A., Zhu, J., et al. (2008). Sparse estimation of large covariance matrices via a nested lasso penalty. The Annals of Applied Statistics, 2(1), 245–263.
Liu, H., Han, F., & Zhang, C. (2012). Transelliptical graphical models. In Advances in Neural Information Processing Systems (pp. 809–817).
Liu, H., Lafferty, J., & Wasserman, L. (2009). The nonparanormal: Semiparametric estimation of high dimensional undirected graphs. The Journal of Machine Learning Research, 10, 2295–2328.
Liu, H., Wang, L., & Zhao, T. (2014). Sparse covariance matrix estimation with eigenvalue constraints. Journal of Computational and Graphical Statistics, 23(2), 439–459.
Liu, S., Quinn, J. A., Gutmann, M. U., & Sugiyama, M. (2013). Direct learning of sparse changes in Markov networks by density ratio estimation. In Joint European conference on machine learning and knowledge discovery in databases (pp. 596–611). Springer.
Ma, S., Gong, Q., & Bohnert, H. J. (2007). An arabidopsis gene network based on the graphical Gaussian model. Genome Research, 17(11), 1614–1625.
Mardia, K. V., Kent, J. T., & Bibby, J. M. (1980). Multivariate analysis. London: Academic Press.
McCall, M. N., Uppal, K., Jaffee, H. A., Zilliox, M. J., & Irizarry, R. A. (2011). The gene expression barcode: Leveraging public data repositories to begin cataloging the human and murine transcriptomes. Nucleic Acids Research, 39(suppl 1), D1011–D1015.
Meinshausen, N., & Bühlmann, P. (2006). Highdimensional graphs and variable selection with the lasso. The Annals of Statistics, 34(3), 1436–1462.
Min, M. R., Ning, X., Cheng, C., & Gerstein, M. (2014). Interpretable sparse highorder Boltzmann machines. In Proceedings of the seventeenth international conference on artificial intelligence and statistics (pp. 614–622).
Mohan, K., London, P., Fazel, M., Lee, S. I., & Witten, D. (2013). Nodebased learning of multiple Gaussian graphical models. arXiv preprint arXiv:1303.5145.
Monti, R. P., Anagnostopoulos, C., & Montana, G. (2015). Learning population and subjectspecific brain connectivity networks via mixed neighborhood selection. arXiv preprint arXiv:1512.01947.
Negahban, S., Yu, B., Wainwright, M. J., & Ravikumar, P. K. (2009). A unified framework for highdimensional analysis of \( m \)estimators with decomposable regularizers. In Advances in Neural Information Processing Systems (pp. 1348–1356).
Ng, B., Varoquaux, G., Poline, J. B., & Thirion, B. (2013). A novel sparse group Gaussian graphical model for functional connectivity estimation. In Information processing in medical imaging (pp. 256–267). Springer.
Orchard, S., Ammari, M., Aranda, B., Breuza, L., Briganti, L., BroackesCarter, F., et al. (2013). The MIntAct project IntAct as a common curation platform for 11 molecular interaction databases. Nucleic Acids Research,. doi:10.1093/nar/gkt1115.
Pang, H., Liu, H., & Vanderbei, R. (2014). The fastclime package for linear programming and largescale precision matrix estimation in R. Journal of Machine Learning Research, 15, 489–493.
Prasad, T. K., Goel, R., Kandasamy, K., Keerthikumar, S., Kumar, S., Mathivanan, S., et al. (2009). Human protein reference database 2009 update. Nucleic Acids Research, 37(suppl 1), D767–D772.
Qiu, H., Han, F., Liu, H., & Caffo, B. (2013). Joint estimation of multiple graphical models from high dimensional time series. arXiv preprint arXiv: 1311.0219.
Ripley, B. D. (2009). Stochastic simulation (Vol. 316). London: Wiley.
Rothman, A. J. (2012). Positive definite estimators of large covariance matrices. Biometrika, 99(3), 733–740.
Rothman, A. J., Bickel, P. J., Levina, E., Zhu, J., et al. (2008). Sparse permutation invariant covariance estimation. Electronic Journal of Statistics, 2, 494–515.
Schmidt, M., & Murphy, K. (2010). Convex structure learning in loglinear models: Beyond pairwise potentials. In Proceedings of the international conference on artificial intelligence and statistics (AISTATS).
Stark, C., Breitkreutz, B. J., Reguly, T., Boucher, L., Breitkreutz, A., & Tyers, M. (2006). Biogrid: A general repository for interaction datasets. Nucleic Acids Research, 34(suppl 1), D535–D539.
Sugiyama, M., Kanamori, T., Suzuki, T., du Plessis, M. C., Liu, S., & Takeuchi, I. (2013). Densitydifference estimation. Neural Computation, 25(10), 2734–2775.
Sun, L., Patel, R., Liu, J., Chen, K., Wu, T., Li, J., et al. (2009). Mining brain region connectivity for Alzheimer’s disease study via sparse inverse covariance estimation. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 1335–1344). ACM.
The Cancer Genome Atlas Research Network. (2011). Integrated genomic analyses of ovarian carcinoma. Nature, 474(7353), 609–615.
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), 267–288.
Wainwright, M. J., & Jordan, M. I. (2006). Logdeterminant relaxation for approximate inference in discrete Markov random fields. IEEE Transactions on Signal Processing, 54(6), 2099–2109.
Yang, E., Lozano, A. C., & Ravikumar, P. K. (2014). Elementary estimators for graphical models. In Advances in neural information processing systems (pp. 2159–2167).
Yuan, M., & Lin, Y. (2007). Model selection and estimation in the Gaussian graphical model. Biometrika, 94(1), 19–35.
Zhang, B., & Wang, Y. (2012). Learning structural changes of Gaussian graphical models in controlled experiments. arXiv preprint arXiv:1203.3532.
Zhang, C. H. (2010). Nearly unbiased variable selection under minimax concave penalty. The Annals of statistics, 38(2), 894–942.
Zhang, Y., & Schneider, J. G. (2010). Learning multiple tasks with a sparse matrixnormal penalty. In Advances in neural information processing systems (pp. 2550–2558).
Zhu, Y., Shen, X., & Pan, W. (2014). Structural pursuit over multiple undirected graphs. Journal of the American Statistical Association, 109(508), 1683–1696.
Acknowledgements
This work was supported by the National Science Foundation under NSF CAREER Award No. 1453580. Any Opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect those of the National Science Foundation.
Author information
Authors and Affiliations
Corresponding authors
Additional information
Editors: Thomas Gärtner, Mirco Nanni, Andrea Passerini, and Celine Robardet.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
About this article
Cite this article
Wang, B., Singh, R. & Qi, Y. A constrained \(\ell \)1 minimization approach for estimating multiple sparse Gaussian or nonparanormal graphical models. Mach Learn 106, 1381–1417 (2017). https://doi.org/10.1007/s1099401756357
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s1099401756357