1 Introduction

Undirected graphical models (UGMs) provide a powerful tool for understanding statistical relationships among random variables. In a typical setting, we can represent the conditional dependency patterns among p random variables \(\{ X_1,\ldots ,X_p \}\) using an undirected graph \(G=(V,E)\). V includes p nodes corresponding to the p variables. E denotes the set of edges describing conditional dependencies among the variables \(\{X_1, \ldots , X_p\}\). If a pair of random variables is conditionally dependent given the rest of variables, there exists an edge in E connecting the corresponding pair of nodes in V; otherwise, the edge is absent. Within the graphical model framework, the task of estimating such undirected graphs based on a set of observed data samples is called structure estimation or model selection. Much of the related literature has focused on estimating G from a given data matrix \(\mathbf {X}_{n\times p}\) (with n observations across p random variables) that are independently and identically drawn from \(N_p(\mu , \varSigma )\). Here \(N_p(\mu , \varSigma )\) represents a multivariate Gaussian distribution with mean vector \(\mu \) (\(\mu \in \mathbb {R}^p\)) and covariance matrix \(\varSigma \) (with size \(p \times p\)). Using the aforementioned G to describe pairwise dependencies among p variables for such a multivariate Gaussian distribution is called a Gaussian graphical model (GGM, e.g., Lauritzen 1996; Mardia et al. 1980). The inverse of the covariance matrix is called the precision matrix, \(\varOmega := \varSigma ^{-1}\). Interestingly, GGM’s conditional independence pattern corresponds to zeros of \(\varOmega \). This means that an edge does not connect ith node and jth node (i.e., conditionally independent) in GGM if and only if \(\varOmega _{ij} = 0\).

To achieve a consistent estimation of G, assumptions are usually imposed on the structure of \(\varOmega \). Most commonly, the graph sparsity assumption has been introduced by various estimators to derive sparse GGM (sGGM). The graph sparsity assumption corresponds to limiting the number of non-zero entries in the precision matrix \(\varOmega \), which leads to a combinational problem for structure estimation. Many classic GGM estimators use the \(\ell _1\)-norm to create convex relaxation of the combinatorial formulation. For instance, the popular estimator “graphical lasso” (GLasso) has considered maximizing a \(\ell _1\)-penalized log-likelihood objective (Yuan and Lin 2007; Banerjee et al. 2008; Hastie et al. 2009; Rothman et al. 2008). More recently, Cai et al. (2011) proposed a constrained \(\ell _1\)-minimization formulation for estimating \(\varOmega \), known as the CLIME estimator. CLIME can be solved through column-wise linear programming and has shown favorable theoretical properties. Moreover, the nonparanormal graphical models (NGM) recently proposed by Liu et al. (2012) have extended sGGM to new distribution families. Both GGM and NGM belong to the general family of UGM (reviewed in Sect. 4).

This paper focuses on the problem of jointly estimating K undirected graphical models from K related multivariate sample blocks. Each sample block contains a different set of data observations on the same set of variables. This task is motivated by the fact that the past decade has seen a revolution in collecting large-scale heterogeneous data from many scientific fields like genetics and brain science. For instance, genomic technologies have delivered fast and accurate molecular profiling data across many cellular contexts (e.g. cell lines or cell stages) (ENCODE Project Consortium 2011). Many neuroimaging studies have collected functional measurements of brain regions across a cohort of multiple human subjects (Di Martino et al. 2014). Such networks can be concerned with identifying subject-specific variations across a population, where each individual is a unique context. For this type of data, understanding and quantifying context-specific variations across multiple graphs is a fundamental analysis task. Figure 1 provides a simple illustration (with two contexts) of the target problem. Interaction patterns that are activated only under a specific context can help to understand or to predict the importance of such a context (Ideker and Krogan 2012; Kelly et al. 2012).

Prior approaches for estimating UGMs from such heterogeneous data tend to either only estimate pairwise differential patterns between two graphs or jointly estimate multiple sGGMs toward a common graph pattern (reviewed in Sect. 4). The former strategy does not exploit the shared network structure across contexts and is not applicable for more than two contexts, leading to undesirable effects on the quality of the estimated networks. Conversely, the latter approach underestimates the network variability and makes implicit assumptions to minimize inter-context differences which are difficult to justify in practice. This is partly caused by the fact that relevant studies have mostly extended the “graphical lasso” (GLasso) estimator to multi-task settings and followed a penalized log-likelihood formulation [Eq. (2)]. Under the GLasso framework, however, explicitly quantifying the context-specific substructures involves a very challenging optimization task (explained in detail in Sect. 4).

This paper proposes a novel approach that uses a constrained \(\ell \)1-minimization formulation for joint structure learning of multiple sparse GGMs or NGMs. We name the method SIMULE (detecting Shared and Individual parts of MULtiple graphs Explicitly), and include the following contributions:

  • Novel model Using a constrained \(\ell \)1 optimization strategy (Sect. 2), SIMULE extends CLIME to a multi-task setting. The learning step is solved efficiently through a formulation of multiple independent sub-problems of linear programming (Sect. 2.4) for which we also provide a parallel version of the learning algorithm. Compared with previous multi-task sGGM models, SIMULE can accurately quantify task-specific network variations that are unique for each task. This also leads to a better generalization and benefits all the involved tasks.

  • Novel extension Furthermore, since most real-world datasets do not follow the normality assumption, we extend SIMULE to Nonparanormal SIMULE (NSIMULE in Sect. 3) by learning multiple NGM under the proposed \(\ell \)1 constrained minimization. NSIMULE can deal with non-Gaussian data that follow the nonparanormal distribution (Liu et al. 2009), a much more generic data distribution family. Fitting NSIMULE is computationally as efficient as SIMULE.

  • Theoretical convergence rate In Sect. 5, we theoretically prove that SIMULE and NSIMULE achieve a consistent estimation of the target (true) dependency graphs with a high probability at the rate \(O(\log (Kp)/n_{tot})\). Here \(n_{tot}\) represents the total number of samples from all tasks and K describes the number of tasks. This proof also theoretically validates the benefit of learning multiple sGGMs jointly (Sect. 7), since the \(O(\log (Kp)/n_{tot})\) convergence rate is better than learning multiple single-sGGMs separately at rate \(O(\log p/n_i)\). \(n_{i}\) represents the number of samples of ith task. Such an analysis hasn’t been provided in any of the previous multi-sGGM studies.

  • Performance improvement In Sect. 6 we show a strong improved performance of SIMULE and NSIMULE over multiple baseline methods on multiple synthetic datasets and two real-world multi-cell biomedical datasets. The proposed methods obtain better AUC and partial AUC scores on all simulated cases. On two real-world datasets, our methods find the most matches of variable interactions when using existing BioMed-databases for validation.

2 Method: SIMULE

Learning multiple UGM jointly is a task of interests in many applications. This paper tries to model and learn context-specific graph variations explicitly, because such variations can “fingerprint” important markers for fields like cognition (Ideker and Krogan 2012), physiology (Kelly et al. 2012) or pathology (Ideker and Krogan 2012; Kelly et al. 2012). We consider the general case of estimating K graphical models from a p-dimensional aggregated dataset in the form of K different data blocks.

In what follows, plain letters denote scalars. Uppercase and lowercase bold letters denote matrices and vectors respectively.Footnote 1 We denote \(\mathbf {X}^{(i)}_{n_i \times p}\) as the ith data block (or data matrix). The total number of data samples uses notation \(n_{tot} =\sum \nolimits _{i=1}^{K}n_i\). The precision matrix uses notation \(\varOmega \) and the covariance matrix uses notation \(\varSigma \). We denote the correlation matrix as S and the inverse of correlation matrix as \(\varPhi \). The vector \(\mathbf {e}_j = (0,\ldots ,1,\ldots ,0)^T\) denotes a basis vector in which only the jth entry is 1 and the rest are 0. For a p-dimensional data vector \(\mathbf {x}= (x_1, x_2, \ldots , x_p)^T \in \mathbb {R}^p\), let \(||\mathbf {x}||_1 = \sum \nolimits _i|x_i|\) be the \(\ell _1\)-norm of \(\mathbf {x}\), \(||\mathbf {x}||_{\infty } = \max \nolimits _i|x_i|\) be the \(\ell _{\infty }\)-norm of \(\mathbf {x}\) and \(||\mathbf {x}||_2 = \sqrt{\sum \nolimits _i x_i^2}\) be the \(\ell _2\)-norm of\(\mathbf {x}\). Similarly, for a matrix \(\mathbf {X}\), let \(||\mathbf {X}||_1 = \sum \nolimits _{i,j}|\mathbf {X}_{i,j}|\) be the \(\ell _1\)-norm of \(\mathbf {X}\) and \(||\mathbf {X}||_{\infty } = \max \nolimits _{i,j}|\mathbf {X}_{i,j}|\) be the \(\ell _{\infty }\)-norm of \(\mathbf {X}\). \(||\mathbf {X}||_2 = \sqrt{\lambda _{\max }(\mathbf {X})}\), here \(\lambda _{\max }\) is the largest eigenvalue of \(\mathbf {X}\). \(||\mathbf {X}||_F = \sqrt{\sum \nolimits _{i,j}\mathbf {X}_{i,j}^2}\) is the F-norm of \(\mathbf {X}\). \(||\mathbf {X}||_\mathbf{1 } = \max \nolimits _{j} \sum \nolimits _i |\mathbf {X}_{ij}|\) is the matrix \(\mathbf 1 \)-norm of \(\mathbf {X}\). \(||\mathbf {X}^{(1)}, \mathbf {X}^{(2)}, \ldots , \mathbf {X}^{(K)}||_{1,p} = \sum \nolimits _{i}||\mathbf {X}^{(i)}||_p\) is the \(\ell _{1,p}\)-norm of \((\mathbf {X}^{(1)}, \mathbf {X}^{(2)}, \ldots , \mathbf {X}^{(K)})\). \(\varOmega \succ 0\) means that \(\varOmega \) is a positive definite matrix. \(||(\varOmega ^{(1)}, \varOmega ^{(2)}, \ldots , \varOmega ^{(K)})||_{\mathcal {G},2} = \sum \nolimits _{j = 1}^p \sum \nolimits _{k = 1}^p ||(\varOmega _{j,k}^{(1)}, \varOmega _{j,k}^{(2)}, \ldots , \varOmega _{j,k}^{(i)}, \ldots , \varOmega _{j,k}^{(K)})||_2\). \(||(\varOmega ^{(1)}, \varOmega ^{(2)}, \ldots , \varOmega ^{(K)})||_{\mathcal {G},\infty } = \sum \nolimits _{j = 1}^p \sum \nolimits _{k = 1}^p ||(\varOmega _{j,k}^{(1)}, \varOmega _{j,k}^{(2)}, \ldots , \varOmega _{j,k}^{(i)}, \ldots , \varOmega _{j,k}^{(K)})||_\infty \).

2.1 Background: CLIME for estimating sparse Gaussian graphical model

The CLIME estimator, short for constrained \(\ell _1\) minimization method for inverse matrix estimation (Cai et al. 2011), uses an \(\ell _1\) constrained optimization [as Eq. (1)] to estimate the precision matrix \(\varOmega \):

$$\begin{aligned} {\mathop {\hbox {argmin}}\limits _{\varOmega }}\,||\varOmega ||_1 ,\quad \text { subject to:} \,\, ||\varSigma \varOmega - I||_{\infty } \le \lambda \end{aligned}$$
(1)

Here \(\lambda > 0\) is the tuning parameter.

The idea comes from taking the first derivative of the objective function of “graphical Lasso” (GLasso) (Yuan and Lin 2007; Friedman et al. 2008). The original objective function (an \(\ell _1\)-penalized log-likelihood) is as follows,

$$\begin{aligned} \widehat{\varOmega }_{glasso} = {\mathop {\hbox {argmin}}\limits _{\varOmega }} \{ -\log \det (\varOmega ) + <\varOmega ,\varSigma > + \lambda ||\varOmega ||_1 \} \end{aligned}$$
(2)

By taking the first derivative of Eq. (2) and setting it equal to zero, the solution \(\widehat{\varOmega }_{glasso}\) also satisfies:

$$\begin{aligned} \widehat{\varOmega }_{glasso}^{-1} - \hat{\varSigma } = \lambda \widehat{Z} \end{aligned}$$
(3)

where \(\widehat{Z}\) is an element of the subdifferential \( \partial ||\widehat{\varOmega }_{glasso}||_1\). This leads to the CLIME estimator in Eq. (1). One of the best properties of CLIME is that it can be solved column by column separately. Suppose \(\beta \) is one of the column vectors in the precision matrix \(\varOmega \). Instead of estimating the entire \(\varOmega \) all at once as in Eq. (1), we can estimate each column \(\beta \) of \(\varOmega \) as follows:

$$\begin{aligned} \hbox {argmin}||\beta ||_1 \quad \hbox {subject to } ||\varSigma \beta -e_j||_{\infty } \le \lambda \end{aligned}$$

After we obtain an estimated \(\widehat{\varOmega }^{1}\) from Eq. (1), to maintain the symmetric property of the estimator, CLIME then uses the following operation,

$$\begin{aligned} \widehat{\varOmega }_{ij} = \widehat{\varOmega }_{ji} = \widehat{\varOmega }^{1}_{ij}\mathop {\mathrm {sign}}\left( \max \left( \left| \widehat{\varOmega }^{1}_{ij}\right| - \left| \widehat{\varOmega }^{1}_{ji}\right| ,0\right) \right) + \widehat{\varOmega }^{1}_{ji} \mathop {\mathrm {sign}}\left( \max \left( \left| \widehat{\varOmega }^{1}_{ji}\right| - \left| \widehat{\varOmega }^{1}_{ij}\right| , 0\right) \right) \end{aligned}$$

Cai et al. (2011) has proved that CLIME achieves a consistent estimation of the true graph with a high probability at the rate \(O(\log p/n)\).

2.2 Background: multi-task learning with task-shared and task-specific parameters

Multi-task learning (MTL) was initially proposed by Caruana (1997) to find common feature weights across multiple relevant tasks. If different tasks are sufficiently related, MTL can lead to a better generalization across all tasks. The specific MTL formulation we explore is suggested by MT-SVM (Evgeniou and Pontil 2004), which models multiple support vector machines (SVMs) through task-joint and task-specific parameters. When given K related tasks in which each sample \((\varvec{x}_i, y_i)\) belongs to exactly one of the tasks \(\{1,\ldots ,K\}\), MT-SVM (Evgeniou and Pontil 2004) learns K distinct parameters \(\varvec{w}_1,\ldots ,\varvec{w}_K\) where each \(\varvec{w}_k\) is specifically dedicated for the task k. In addition, MT-SVM utilizes a global parameter \(\varvec{w}_0\) to capture the commonality among all the tasks. Each sample \(\varvec{x}\) is classified by \(\hat{y}_i = \text {sign} (\varvec{x}^T_i ( \varvec{w}_0 + \varvec{w}_k))\). Clearly, the vectors \(\varvec{w}_k\) are “small” when the tasks are similar to each other. In other words, we assume that the tasks are related in such a way that the true models are all close to some (shared) model \(\varvec{w}_0\). MT-SVM estimates all \(\varvec{w}_1,\ldots ,\varvec{w}_K\) and \(\varvec{w}_0\) simultaneously.

Fig. 1
figure 1

A simple example of the target problem. We are trying to learn (1) individual parts that are specific to each task itself (two left lower subfigures) and (2) at the same time, the shared part that is common for all tasks (the right lower subfigure)

2.3 SIMULE: infer Shared and Individual parts of MULtiple sGGM Explicitly

Treating sparse GGM estimation from each data block as a single task, our main goal is to learn multiple sGGMs over K tasks jointly, which can lead to a better generalization across all of the involved tasks (theoretically proven in Sect. 5).

Mirroring the strategy used by previous joint-estimators of multiple GLasso (described in Sect. 4), we achieve “multi-tasking” by summing up the CLIME estimators from each of the K tasks.

$$\begin{aligned}&\hat{\varOmega }^{(i)}={\mathop {\hbox {argmin}}\limits _{\varOmega ^{(i)}}}\sum \limits _i ||\varOmega ^{(i)}||_1\nonumber \\&\text {Subject to: } \; ||\varSigma ^{(i)}\varOmega ^{(i)} - I||_{\infty } \le \lambda _{n_{tot}}, \quad i = 1,\ldots ,K \end{aligned}$$
(4)

Here \(n_{tot} =\sum \nolimits _{i=1}^{K}n_i\) represents the total number of data samples used in all tasks. It is worth mentioning that we select the hyperparameter by considering the sample size, i.e., using \(\lambda _{n_{tot}}\) instead of \(\lambda \) (based on two recent studies (Negahban et al. 2009; Yang et al. 2014), details in Sect. 6). To simplify notations, we use \(\lambda _n\) instead of \(\lambda _{n_{tot}}\) in the rest of this section.

Then following the MTL formulation of MT-SVM, we simply model each sGGM network as

$$\begin{aligned} \varOmega ^{(i)} = \varOmega ^{(i)}_I + \varOmega _S, \end{aligned}$$
(5)

where \(\varOmega _S\) is the shared pattern among all graphs and \(\varOmega _I^{(i)}\) represents the individual part specific for ith graph (see Fig. 1 for a simple illustration). Furthermore, we assume both the shared and the individual parts of each \(\varOmega ^{(i)}\) should be sparse and convert this assumption into the following formulation,

$$\begin{aligned} \left| \left| \varOmega ^{(i)}_I\right| \right| _1+ \varepsilon ||\varOmega _S||_1 \end{aligned}$$
(6)

Here we assume \(\varepsilon > 0\). This hyper-parameter is chosen according to applications. It reflects the difference of sparsity level between the shared part and the context-specific parts. For example, due to evolutionary stability and/or system stability, we usually assume that the shared part of gene networks is more dense than individual subgraphs of each cell context. Accordingly, Sect. 6 uses \(\varepsilon < 1\) for two real-world molecular expression datasets.

Now we replace the objective in  Eq. (4) with the new objective from Eq. (6) and plug  Eq. (5) into the constraints of  Eq. (4). This gives us the following novel formulation of SIMULE:

$$\begin{aligned}&\hat{\varOmega }^{(1)}_I, \hat{\varOmega }^{(2)}_I, \ldots , \hat{\varOmega }^{(K)}_I, \hat{\varOmega }_S = {\mathop {\hbox {argmin}}\limits _{\varOmega ^{(i)}_I,\varOmega _S}}\sum \limits _i \left| \left| \varOmega ^{(i)}_I\right| \right| _1+ \varepsilon K||\varOmega _S||_1 \nonumber \\&\text {Subject to:} \; ||\varSigma ^{(i)}(\varOmega ^{(i)}_I + \varOmega _S) - I||_{\infty } \le \lambda _{n}, \quad i = 1,\ldots ,K \end{aligned}$$
(7)

In Sect. 5, we theoretically prove that those \(\{ \varOmega ^{(i)}|i = 1,\ldots ,k\}\) estimated from Eq. (7) are positive definite and converge to the true precision matrices with a high probability.

Equation (7) can be solved column by column without influencing the resulting solution. For each column, we solve the following optimization:

$$\begin{aligned}&{\mathop {\hbox {argmin}}\limits _{\beta ^{(i)},{\beta ^s}}}\sum \limits _i||\beta ^{(i)}||_1+ \varepsilon K||{\beta ^s}||_1 \nonumber \\&\text {Subject to:} \; ||\varSigma ^{(i)}(\beta ^{(i)}+{\beta ^s})-e_j||_{\infty }\le \lambda _{n},\quad i = 1,\ldots ,K \end{aligned}$$
(8)

\(\beta ^{(i)}\) is one of the column vectors in the individual part \(\varOmega ^{(i)}_I\) of ith graph (we take out the subscript I in \(\beta ^{(i)}\) to simplify notations). \({\beta ^s}\) is the corresponding column in the shared part \(\varOmega _S\).

Equivalently, SIMULE solves the following optimization, for each column:

$$\begin{aligned}&{\mathop {\hbox {argmin}}\limits _{\beta ^{(i)},{\beta ^s}}}\,\left| \left| \left[ \beta ^{(1)}_{p \times 1}, \ldots ,\beta ^{(i)}_{p \times 1},\ldots , \beta ^{(K)}_{p \times 1} ,\varepsilon K{\beta ^s}_{p \times 1}\right] _{p \times (K + 1)}\right| \right| _1, \nonumber \\&\text {Subject to:} \nonumber \\&\quad ||[0,\ldots ,0,\varSigma ^{(i)},0,\ldots ,0,\varSigma ^{(i)}]_{p \times (K+1)p}\nonumber \\&\quad [\beta ^{(1)^T}, \ldots , \beta ^{(i)^T},\ldots , \beta ^{(K)^T}, {(\beta ^s)}^T]_{(K+1)p \times 1}^T\nonumber \\&\quad -e_j|_{\infty }\le \lambda _{n}, \quad i= 1,\ldots ,K. \end{aligned}$$
(9)

2.4 Optimization

By simplifying notations, Eq. (9) can be rewritten as

$$\begin{aligned}&{\mathop {\hbox {argmin}}_{\theta }}\,||\theta ||_1\nonumber \\&\quad \text {Subject to:}\; |\mathbf {A}^{(i)}\theta - b|_{\infty } \le c, \; i = 1,\ldots , K \nonumber \\&\quad \quad \text {Where} \; \mathbf {A}^{(i)} = \left[ 0,\ldots ,0,\varSigma ^{(i)},0,\ldots ,0,\frac{1}{\varepsilon K}\varSigma ^{(i)}\right] , \nonumber \\&\quad \quad \theta = \left[ \beta ^{(1)^T}, \ldots , \beta ^{(i)^T},\ldots ,\beta ^{(K)^T}, \varepsilon K{(\beta ^s)}^T\right] ^T, \varvec{b}= \mathbf {e}_j, c = \lambda _{n} \end{aligned}$$
(10)
figure a

Through relaxation, we convert Eq. (10) to the following linear programming formulation in Eq. (11),

$$\begin{aligned}&{\mathop {\hbox {argmin}}\limits _{\mathbf {u}_j,\theta }} \sum \limits _{j=1}^{(K+1)p} \mathbf {u}_j\nonumber \\&\hbox {subject to:}\nonumber \\&\quad -\theta _j \le \mathbf {u}_j ,\quad j=1,\ldots , (K+1)p\nonumber \\&\quad \theta _j \le \mathbf {u}_j,\quad j =1,\ldots , (K+1)p\nonumber \\&\quad -\mathbf {A}_{k,}^{(i)^T}\theta + \varvec{b}_k \le c,\quad k = 1,\ldots ,p , i = 1,\ldots ,K\nonumber \\&\quad \mathbf {A}_{k,}^{(i)^T}\theta - \varvec{b}_k \le c,\quad k = 1,\ldots ,p, i = 1,\ldots , K \end{aligned}$$
(11)

Here a set of \(\mathbf {u}_j\) are the slack variables, \(\mathbf {A}_{k,}^{(i)}\) means the kth row of \(\mathbf {A}^{(i)}\) and \(\varvec{b}_k\) is the kth entry of b. The pseudo code of SIMULE is summarized in Algorithm 1. Following CLIME, then we apply the same symmetric operators on \(\{\varOmega ^{(i)} = \varOmega _S + \varOmega _I^{(i)} \}\) obtained from Algorithm 1. Section 5 proves that \(\varOmega ^{(i)}\) converges to the true one with a optimal convergence rate.

To solve Eq. (11), we follow the primal dual interior method (Boyd and Vandenberghe 2004) that has also been used in the Dantzig selector for the task of regression (Candes and Tao 2007). Other strategies can be used to solve this linear programming, such as the one used in Pang et al. (2014).

2.5 Parallel version of SIMULE

Algorithm 1 can easily be revised into a parallel version. Essentially we just need to revise the “For loop” of step 8 in Algorithm 1 into, for instance, “column per machine” or “column per core”. Since the calculation of each column is independent from the other columns, this parallel variation will obtain the same solution as Algorithm 1 at a better speed. Section 6 shows the speed improvements of SIMULE in a multi-core setting versus a single-core setting.

3 Method variation: nonparanormal SIMULE (NSIMULE)

Though sGGM is powerful, its normality assumption is commonly violated in real applications. For instance, for the TF ChIP-Seq data analyzed in Sect. 6.6, the histogram of one of its TF variables is clearly not following Gaussian distribution (across samples, shown as the right distribution graph in Fig. 2). After a univariate log-transformation of the same feature, we obtain its distribution histogram as the left graph in Fig. 2. The transformed data samples are approximately normally distributed. This motivates us to adopt a more generalized UGM (recently proposed in Liu et al. 2009) to overcome the limitation of sGGM. This so-called “nonparanormal graphical model” (Liu et al. 2009) assumes that data samples follow a multivariate nonparanormal distribution, which is a strict superset of the Gaussian distribution. We extend SIMULE to the nonparanormal family and name this novel variation NSIMULE. NSIMULE learns to fit multiple NGMs jointly through modeling task-shared and task-specific parameters explicitly.

3.1 Background: nonparanormal graphical model

Fig. 2
figure 2

A simple example showing nonparanormal graphical model and its unobserved latent Gaussian graphical model. The leftmost sub-figure shows \(X \sim N_p(\mu ,\mathbf {S})\). The rightmost sub-figure shows \(Z \sim NPN_{p}(\mu ,\mathbf {S};f_{1},\ldots ,f_{p})\). The right distribution graph shows the histogram of one feature \(\mathbf {z}_i\) (one TF variable, data details in Sect. 6.6) from a real TF binding dataset. The left histogram graph shows the distribution of a log-transformation of the same feature (\(\mathbf {z}_i\)). Because we can clearly see that the left histogram roughly follows a Gaussian distribution, this indicates \(\mathbf {z}_i\) follows a nonparanormal distribution. This shows the need to extend SIMULE to the nonparanormal distribution that is a strict superset of the Gaussian distribution

A random variable \(Z = (Z_{1}, \ldots , Z_{p})^{T}\) is said to follow a nonparanormal distribution

$$\begin{aligned} Z \sim NPN_{p}(\mu ,\mathbf {S};f_{1},\ldots ,f_{p}) \end{aligned}$$

if and only if there exists a set of univariate strictly increasing transformations \(f = \{f_{j}\}_{j=1}^{p}\) such that:

$$\begin{aligned} f(Z) = (f_{1}(Z_{1}), \ldots , f_{p}(Z_{p}))^{T} := X \sim N(\mu ,\mathbf {S}) \end{aligned}$$

Figure 2 shows a simple example of NGM (inside the rightmost sub-figure) and its unobserved latent Gaussian graphical model (inside the leftmost sub-figure). Assume that we are given a dataset including n observations that are independently and identically drawn from \(NPN_{p}(\mu ,\mathbf {S};f_{1},\ldots ,f_{p})\), a multivariate nonparanormal distribution. The conditional independence graph G among \(Z_i\) variables can be modeled with a corresponding NGM (Liu et al. 2009). The graph structure of NGM is encoded through the sparsity pattern of the inverse covariance matrix \(\varPhi := \mathbf {S}^{-1}\), where \(\mathbf {S}\) denotes the covariance matrix of \(NPN_{p}\).

3.2 Background: estimate \(\mathbf {S}\) through rank-based measures of correlation matrix \(\mathbf {S}_0\)

Since the direct estimation of covariance matrix \(\mathbf {S}\) is difficult in nonparanormal distribution, recent studies have proposed an efficient nonparametric estimator (Liu et al. 2009) for \(\mathbf {S}\). This estimator is derived from the correlation matrix \(\mathbf {S}_0\). Because the covariance matrix \(\mathbf {S}= diag(\mathbf {S}_i) \mathbf {S}_0 diag(\mathbf {S}_i)\), \(\mathbf {S}^{-1} = diag(\mathbf {S}_i)^{-1} \mathbf {S}_0^{-1} diag(\mathbf {S}_i)^{-1}\). Here \(\mathbf {S}_i = \sqrt{Cov(Z_i,Z_i)}\) and \(diag(\mathbf {S}_i) = diag(\mathbf {S}_1,\mathbf {S}_2,\ldots ,\mathbf {S}_p)\). Therefore, the inverse of correlation matrix (\(\mathbf {S}_0^{-1}\)) and the inverse of covariance matrix (\(\mathbf {S}^{-1}\)) have the same nonzero and zero entries. Based on this observation, Liu et al. (2009) proposed a nonparametric method to estimate the correlation matrix \(\mathbf {S}_0\), instead of estimating the covariance matrix \(\mathbf {S}\) for the purpose of structure inference.

In Liu et al. (2009) the authors proposed using the population Kendall’s tau correlation coefficients \(\tau _{jk}\) to estimate \(\mathbf {S}_0\), based upon the explicit relationship between this rank-based measure \(\tau _{jk}\) and the correlation measure \((\mathbf {S}_{jk})_0\) for a given nonparanormal dataset \(Z \sim NPN_{p}(\mu , \mathbf {S},f_{1},\ldots ,f_{p})\) (discussed in Liu et al. 2012). Figure 2 presents the simple relationship between \(Z \sim NPN_{p}(\mu ,S;f_{1},\ldots ,f_{p})\) and its latent \(X \sim N(\mu ,S)\). To simplify notations, we use \(\mathbf {S}\) to represent the correlation matrix for the remainder of this paper.

Theorem 1

Given \(Z \sim NPN_{p}(\mu , S, f_{1},\ldots ,f_{p})\) , a nonparanormal distribution, we have that

$$\begin{aligned} \mathbf {S}_{jk} = sin\left( \frac{\pi }{2}\tau (Z_{j},Z_{k})\right) . \end{aligned}$$
(12)

where the Kendall’s tau can be estimated as:

$$\begin{aligned} \hat{\tau }_{jk} = \frac{1}{n(n-1)} \sum \limits _{1 \le i \le i' \le n} \mathop {\mathrm {sign}}\left( \left( \mathbf {z}_j^i - \mathbf {z}_j^{i'}\right) \left( \mathbf {z}_k^{i} - \mathbf {z}_k^{i'}\right) \right) \end{aligned}$$

Proof

The proof is provided in Liu et al. (2009). \(\square \)

Therefore, the correlation matrix \(\mathbf {S}\) can be estimated as:

$$\begin{aligned} \widehat{\mathbf {S}}_{jk} = sin\left( \frac{\pi }{2}\hat{\tau }_{jk}\right) . \end{aligned}$$

We can then plug in the estimated \(\widehat{\mathbf {S}}\) for learning the dependency graph structure in the corresponding NGM.

3.3 NSIMULE: nonparanormal SIMULE

We can now substitute each sample covariance matrix \(\hat{\varSigma }^{(i)}\) used in  Eq. (9) from each task with its corresponding correlation matrix \(\mathbf {S}^{(i)}\) as estimated above. The rest of the computations are the same as SIMULE. We refer to this whole process as NSIMULE. It estimates multiple different, but related, sparse Nonparametric Graphical Models (sNGM) through shared and task-specific parameter representations.

Theorem 2

If XY are two independent random variables and f,g \(:\mathbb {R}\rightarrow \mathbb {R}\) are two measurable functions, then f(X) and g(Y) are also independent.

Through the above theorem, the monotone functions f in \(NPN_{p}\) will not change the conditional dependency among variables. As proved in Liu et al. (2009), the conditional dependency network among the latent Gaussian variables X (in this \(NPN_{p}\)) is the same as the conditional dependency network among the nonparanormal variables \(Z_i\), with a parametric asymptotic convergence rate. Therefore, we can use the estimated correlation matrices \(\mathbf {S}^{(i)}\) for the joint network inference of multiple sNGMs in SIMULE. This is also because we have shown that the inverse of the correlation matrix and the inverse of the covariance matrix share the same nonzero and zero patterns.

4 Related work

4.1 Connecting to past multi-sGGM studies

Sparse GGM is an extremely active topic in the recent literature including notable studies like Wainwright and Jordan (2006) and Banerjee et al. (2008). We can categories single-task sGGM estimators into three groups: (a) penalized likelihood (GLasso), (b) neighborhood approach and (c) CLIME estimator.

Most previous methods that estimate multiple sGGMs jointly (on the same set of variables from aggregated data samples) can be formulated as:

$$\begin{aligned}&{\mathop {\hbox {argmin}}\limits _{\varOmega ^{(i)} > 0}} \sum \limits _i (-L(\varOmega ^{(i)}) + \lambda _1 \sum _i ||\varOmega ^{(i)}||_1 \nonumber \\&\quad + \lambda _2 P(\varOmega ^{(1)}, \varOmega ^{(2)}, \ldots ,\varOmega ^{(K)}) \end{aligned}$$
(13)

where \(\varOmega ^{(i)}\) denotes the precision matrix for the ith task. \(L(\cdot )\) represents log-likelihood or pseudo-likelihood function. \(\sum \nolimits _i ||\varOmega ^{(i)}||_1\) adds sparsity constraints on each task. \( P(\varOmega ^{(1)}, \varOmega ^{(2)}, \ldots ,\varOmega ^{(K)})\) enforces certain joint properties among the tasks (like using group sparsity to enforce similarity among tasks).

We choose the most relevant three studies as our baselines in the experiments: (a) Fused Joint graphical lasso (JGL-fused) (Danaher et al. 2013), (b) Group Joint graphical lasso (JGL-group) (Danaher et al. 2013) and (c) SIMONE (Chiquet et al. 2011). JGL-fused and JGL-group are based on the popular “graphical lasso” estimator (Friedman et al. 2008; Yuan and Lin 2007); [using \(L(\varOmega ) =(\log det(\varOmega ) - <\varSigma , \varOmega >)\) in  Eq. (13)]. SIMONE (Chiquet et al. 2011) follows neighborhood-selection based estimator. It can be viewed as using a pseudo-likelihood approximation instead of the full likelihood as \(L(\varOmega )\) in  Eq. (13).Footnote 2

The general purpose of the penalty function \( P(\varOmega ^{(1)}, \varOmega ^{(2)}, \ldots ,\varOmega ^{(K)})\) in Eq. (13) is to push the inference of multiple graphs toward a common pattern. Table 1 provides a representative list of penalty functions that have been used previously for the multi-sGGM setting. For example, JGL uses the fused norm (the 1st row of Table 1) to penalize the difference between two graphs with resulting variation named as JGL-fused. JGL-group uses a \(\{ \mathcal {G},2 \}\) norm that pushes multiple graphs to have the same sparsity patterns (the 2nd row of Table 1). SIMONE provides a novel penalty proposed by the authors (shown as in the 3rd row of Table 1) to enforce similar sparsity pattern on multiple graphs.

Table 1 A list of representative multi-sGGM methods and the second penalty functions they have used

In addition to these three baselines, a number of recent studies also perform multi-task learning of sGGM (Honorio and Samaras 2010; Guo et al. 2011; Zhang and Wang 2012; Zhang and Schneider 2010; Zhu et al. 2014). They all follow the same formulation as  Eq. (13) but explore a different second penalty-function \(P(\varOmega ^{(1)}, \varOmega ^{(2)}, \ldots ,\varOmega ^{(K)})\). As an example, Node-based JGL proposed a novel penalty, namely RCON (Mohan et al. 2013) (shown as the 4th row of the Table 1) or “row-column overlap norm” for capturing special relationship among graphs. In two recent works, the penalty function at the 5th row is Table 1 has been used by Guo et al. (2011) and the penalty function at the 6th row of Table 1 has been used by Honorio and Samaras (2010).

Furthermore, there exist studies that explore similar motivations as ours when learning multiple GGM models from data. (a) Han et al. (2013) proposed to estimate a population graph from multi-block data using a so-called “median-graph” idea. It is conceptually similar to \(\varOmega _S\). However, they do not have \(\varOmega ^{(i)}_I\) to model individual parts that are specific to each task. (b) Another recent study, CSSL-GGM Hara and Washio (2013) also tried to model both the shared and individual substructures in multi-sGGMs. Different from ours, their formulation is within the penalized likelihood framework as  Eq. (13). They used \(\ell \)1,p norm (see last row of Table 1) to regularize the task-specific parts, while SIMULE uses \(\ell \)1 norm instead in  Eq. (6). The \(\ell \)1,p norm pushes the individual parts of multiple graphs to be similar which is contradictory to the original purpose of these parameters.Footnote 3 (c) More recently, Monti et al. (2015) proposed to learn population and subject-specific brain connectivity networks via a so-called “Mixed Neighborhood Selection” (MSN) method. Following the neighborhood selection framework (Meinshausen and Bühlmann 2006), for each node v, MSN tried to learn the neighborhood of each v. Similar to SIMULE, they estimated the neighborhood edges of a given node v in the i-task as \(\beta ^v + \widetilde{\varvec{b}}^{(i),v}\). Here \(\beta ^v\) represents the neighbor in the shared part and \(\widetilde{\varvec{b}}^{(i),v}\) represents the neighbors that are specific to the ith graph. Since MSN is specially designed for brain imaging data, it assumes each individual graph is generated by random effects, i.e., \(\widetilde{\varvec{b}}^{(i),v} \sim N(0,\varPhi ^v)\). SIMULE does not have such strong assumptions on either task-specific or task-shared substructures. Our model is more general while MSN is designed for brain imaging data. (d) Another line of related studies (Liu et al. 2013; Sugiyama et al. 2013; Fazayeli and Banerjee 2016) prosed density-ratio based strategies to estimate a differential graph between two graphs. Even though this group of methods can handle the unbalance dataset (i.e., the numbers of samples in two datasets are quite different), they can only capture the difference between two graphs (\(K=2\)). SIMULE does not have such a limitation on the number of tasks. (e) Moreover, several loosely related studies exist in settings different from ours. For example, for handling high-dimensional time series data a few recent papers have considered exploring multiple sGGMs by modeling relationships among networks; e.g., Kolar et al. (2010), Qiu et al. (2013).

4.2 Biasing covariance matrices with SIMONE-I and SIMULE-I

The SIMONE package (Chiquet et al. 2011) has introduced “intertwined Lasso” (named as “SIMONE-I” in the rest of paper) that takes the perspective of sharing the information among covariance matrices. More specifically, this variation averages each task’s sample covariance matrix with a global empirical covariance matrix obtained from the whole dataset. Motivated by SIMONE-I, we extend SIMULE with a similar strategy that revises each task’s sample covariance matrix with \(\tilde{ \varSigma }^{(i)} = \alpha \varSigma ^{(i)} + (1-\alpha ) n_{tot}^{-1}\sum _{t=1}^K {n_t\varSigma ^{(t)}} \). Here \(\alpha = 0.5\) . This variation is referred as “SIMULE-I” for the rest of this paper. We report experimental results from both SIMONE-I and SIMULE-I in Sect. 6.

Fan et al. (2014) have pointed out that sample covariance matrix is inconsistent under a high-dimensional setting. There exists a huge body of previous literature for covariance matrices estimation. Roughly relevant studies can be grouped into three types: (a) Sparse covariance matrix estimation, including studies like hard-thresholding (Lam and Fan 2009), soft-thresholding (Tibshirani 1996), smoothly clipped absolution deviation (SCAD, Fan et al. 2013), and minimax concavity penalties (Zhang 2010). These methods, though simple, do not guarantee the positive definiteness of the estimated covariance matrix. (b) Positive definite sparse covariance matrix estimation that introduces penalty functions on the eigenvector space. Popular studies include Antoniadis and Fan (2011), Liu et al. (2014), Levina et al. (2008), Rothman (2012). (c) Factor-model based covariance matrix estimation like POET (Fan et al. 2013) that applies soft-thresholding to the residual space obtained after applying an approximate factor structure on the estimated covariance matrix. Most of these studies can be used to extend SIMULE. We leave a more thorough study of such combinations as future work.

4.3 Penalized log-likelihood for SIMULE

Comparing Eq. (9) of SIMULE with Eq. (13), previous multi-sGGM approaches mostly relied on penalized log-likelihood functions for learning. We have also considered extending penalized log-likelihood method such as the “graphical lasso” estimator into our MTL setting: \(\varOmega ^{(i)} = \varOmega ^{(i)}_I + \varOmega _S\). However, it is hard to deal with the log determinant term \(logdet(\varOmega ^{(i)}_I + \varOmega _S)\) in optimization. This is because \(\frac{\partial logdet(\mathbf {X}+ \mathbf {Y})}{\partial \mathbf {X}} = (\mathbf {X}+ \mathbf {Y})^{-1}\) and it is difficult to calculate the inverse of \(\mathbf {X}+\mathbf {Y}\). There is no closed form for the inverse of the sum of two matrices, except for certain special cases. From the perspective of optimization, it is hard to directly use methods like coordinate descent for learning such a model due to this first derivative issue. The authors of CSSL-GGM (Hara and Washio 2013) handled the issue by using Alternating Direction Method of Multipliers (ADMM) . This optimization is much more complicated than SIMULE. Another package MNS (Monti et al. 2015) tackled this issue partly through adding latent variables. It assumes that each individual part \(\widetilde{\varvec{b}}^{(i),v} \sim N(0,\varPhi ^v)\), in which \(\varPhi ^v\) is a latent variable learned by the EM algorithm.

4.4 Relevant studies using \(\ell _1\) optimization with constraints

The formulation of using \(\ell _1\) based objective with constraints has been explored by other tasks before. (a) For example, the Dantzig estimator (Candes and Tao 2007) uses the following \(\ell _1\) optimization with constraints for high-dimensional linear regression:

$$\begin{aligned} {\mathop {\hbox {argmin}}\limits _{\beta }} ||\beta ||_1 \quad \text {subject to: } ||\mathbf {X}^T\mathbf {y}-\mathbf {X}^T\mathbf {X}\beta ||_{\infty }\le \lambda . \end{aligned}$$

This is in contrast to the lasso estimator using the penalized log-likelihood formulation for the high-dimensional regression:

$$\begin{aligned} {\mathop {\hbox {argmin}}\limits _{\beta }}||\mathbf {Y}- \beta \mathbf {X}||_F + \lambda ||\beta ||_1 \end{aligned}$$

(b) For the case of single-task sGGM, the CLIME estimator (Cai et al. 2011) infers precision matrix through solving Eq. (1): \(\ell \)1 optimization with constraints (and further derived into a linear programming formulation). This has been shown to provide more favorable theoretical properties than penalized likelihood approaches like GLasso [solving Eq. (2)]. (c) Based on our knowledge, the proposed SIMULE model is the first study to use constrained \(\ell \)1 optimization for the case of multi-tasking sGGM. Table 2 summarizes relevant studies from the perspective of two different optimization formulations.

Table 2 Two categories of relevant studies: learning with “penalized log-likelihood” or learning with “\(\ell \)1 constrained-optimization”

4.5 Optimization and computational concerns

Furthermore, a number of papers have focused on proposing ways to improve the performance of computation and data storage when estimating sGGMs. For example, the BigQUIC algorithm (Hsieh et al. 2011) aims at an asymptotic quadratic optimization when estimating sGGM. The authors of the Node-based JGL method (Mohan et al. 2013) proposed a block-separate method to improve the computational performance of multi-sGGMs. Our method can be extended to a parallel setting, since it can be naturally decomposed into per column based optimization. Node-based JGL (Mohan et al. 2013) has also used the ADMM optimization algorithm for learning. Compared to linear-programming-based SIMULE, Mohan et al. (2013) has two main disadvantages: (a) a large number of complex optimizations are required to be solved, and (b) the condition of convergence is unspecified. SIMULE makes use of an efficient algorithm, is proven to achieve good numerical performance, and ensures convergence (see Sect. 5).

Furthermore, since imposing an \(\ell _1\) penalty on the model parameters formulates the structure learning of UGMs as a convex optimization problem, this strategy has shown successful results for modeling continuous data with GGM or NGM and discrete data with the pairwise Markov Random Fields (MRFs) (Friedman et al. 2008; Höfling and Tibshirani 2009). However, the discrete case of pairwise MRF is much harder because of the potentially intractable normalizing constant and also the possibility that each edge may have multiple parameters. One more complicating factor about structure learning is that the pairwise assumption might need a few exceptions, like searching for higher-order combinatorial interactions in recent studies like Schmidt and Murphy (2010), Buchman et al. (2012). (Detailed descriptions of related work are not included due to space limitation.)

5 Theoretical Analysis

5.1 Theoretical Analysis for Basic SIMULE

This section proves the theoretical properties of the SIMULE estimator. To present our analysis more concisely, we reformulate our model into a more general presentation as follows:

$$\begin{aligned} \varSigma _{tot} := & {} \begin{pmatrix} \varSigma ^{(1)} &{}\quad 0 &{}\quad \cdots &{}\quad 0 \\ 0 &{}\quad \varSigma ^{(2)} &{}\quad \cdots &{}\quad 0 \\ \vdots &{}\quad \vdots &{}\quad \ddots &{}\quad \vdots \\ 0 &{}\quad 0 &{}\quad \cdots &{}\quad \varSigma ^{(K)} \end{pmatrix} =(\sigma _{ij})_{Kp\times Kp}\\ \varOmega _I := & {} \begin{pmatrix} \varOmega _I^{(1)} &{}\quad 0 &{}\quad \cdots &{}\quad 0 \\ 0 &{}\quad \varOmega _I^{(2)} &{}\quad \cdots &{}\quad 0 \\ \vdots &{}\quad \vdots &{}\quad \ddots &{}\quad \vdots \\ 0 &{}\quad 0 &{}\quad \cdots &{}\quad \varOmega _I^{(K)} \end{pmatrix}\\ \varOmega _S^{tot} := & {} \begin{pmatrix} \varOmega _S &{}\quad 0 &{}\quad \cdots &{}\quad 0 \\ 0 &{}\quad \varOmega _S &{}\quad \cdots &{}\quad 0 \\ \vdots &{}\quad \vdots &{}\quad \ddots &{}\quad \vdots \\ 0 &{}\quad 0 &{}\quad \cdots &{}\quad \varOmega _S \end{pmatrix}\\ \varOmega _{tot} := & {} \begin{pmatrix} (\varOmega _S+\varOmega _I^{(1)}) &{}\quad 0 &{}\quad \cdots &{}\quad 0 \\ 0 &{}\quad (\varOmega _S+\varOmega _I^{(2)}) &{}\quad \cdots &{}\quad 0 \\ \vdots &{}\quad \vdots &{}\quad \ddots &{}\quad \vdots \\ 0 &{}\quad 0 &{}\quad \cdots &{}\quad (\varOmega _S+\varOmega _I^{(K)}) \end{pmatrix} = \varOmega _I + \varOmega _S^{tot} \\ X_{tot} := & {} \begin{pmatrix} X^{(1)} &{}\quad 0 &{}\quad \cdots &{}\quad 0 \\ 0 &{}\quad X^{(2)} &{}\quad \cdots &{}\quad 0 \\ \vdots &{}\quad \vdots &{}\quad \ddots &{}\quad \vdots \\ 0 &{}\quad 0 &{}\quad \cdots &{}\quad X^{(K)} \end{pmatrix}\\ I_K := & {} \begin{pmatrix} I &{}\quad 0 &{}\quad \cdots &{}\quad 0 \\ 0 &{}\quad I &{}\quad \cdots &{}\quad 0 \\ \vdots &{}\quad \vdots &{}\quad \ddots &{}\quad \vdots \\ 0 &{}\quad 0 &{}\quad \cdots &{}\quad I \end{pmatrix} , \hbox { where } I \hbox { is a } p\times p \hbox { identical matrix}. \end{aligned}$$

We first introduce the following important Lemma.

Lemma 1

For \(\varepsilon > 0\) and \(a,b \in \mathbb {R}\). If

$$\begin{aligned}&a,b = {\mathop {\hbox {argmin}}\limits _{x,y}} |x| + \varepsilon |y|\nonumber \\&\quad \text {Subject to: } x + y = E \end{aligned}$$
(14)

where E is a constant. Then \(ab \ge 0\).

Proof

If \(ab<0\), let \(c = a + b\) and \(d = 0\). \(cd = 0 \ge 0\) and \(c+d = a+b\). Therefore, \(|c| + \varepsilon |d| = |a+b|<|a| < |a| + \varepsilon |b|\). This contradicts ab are the optimal solution of Eq. (14). \(\square \)

Corollary 1

Assume \(\widehat{\varOmega }_I^{(i)}\) and \(\widehat{\varOmega }_S\) are the optimal solution of Eq. (7), then

\(||(\varOmega _S + \varOmega _I^{(i)})||_1 = ||\varOmega _S||_1 + ||\varOmega _I^{(i)}||_1\)

Proof

By Lemma 1, we have that \(\widehat{\varOmega }_{I,j,k}^{(i)}\widehat{\varOmega }_{S,j,k} \ge 0\), if \(\widehat{\varOmega }_I^{(i)}\) and \(\widehat{\varOmega }_S\) are the optimal solution of Eq. (7). \(\square \)

Now we can rewrite our model in Eq. (7) as:

$$\begin{aligned}&{\mathop {\hbox {argmin}}_{\varOmega _{tot},\varOmega _S^{tot}}} ||\varOmega _{tot}||_1 + (\varepsilon - 1)||\varOmega _S^{tot}||_1 \nonumber \\&\quad \; \text {Subject to:} \; ||(\varSigma _{tot}\varOmega _{tot} - I_K)||_{\infty } \le \lambda _n. \end{aligned}$$
(15)

This because \(||\varOmega _{tot}||_1 + (\varepsilon - 1)||\varOmega _S^{tot}||_1 = \sum \nolimits _{i=1}^K ||(\varOmega _S + \varOmega _I^{(i)})||_1 + K(\varepsilon - 1)||\varOmega _S||_1\). By Corollary (1), \(||\varOmega _{tot}||_1 + (\varepsilon - 1)||\varOmega _S^{tot}||_1 = \sum \nolimits _i ||\varOmega ^{(i)}_I||_1+ \varepsilon K||\varOmega _S||_1\).

We use \(\varSigma _{tot}^0\) to represent the true value of \(\varSigma _{tot}\) and \(\widehat{\varSigma }_{tot}\) as the estimated. We also use \(\varOmega ^0_{tot} = (\omega _1^0,\omega _2^0,\ldots ,\omega _{Kp}^0)\) to describe the true \(\varOmega _{tot}\) and \(\widehat{\varOmega }^1_{tot} = (\hat{\omega }^1_{ij})\) to denote the solution of the optimization problem in  Eq. (15). The final solution is denoted as \(\widehat{\varOmega }_{tot} := (\hat{\omega }_{ij}) = (\hat{\omega }_1,\hat{\omega }_2,\ldots ,\hat{\omega }_{Kp})\) where \(\hat{\omega }_{ij} = \hat{\omega }_{ji} = \hat{\omega }^{1}_{ij}\mathop {\mathrm {sign}}( \max (|\hat{\omega }^{1}_{ij}|- |\hat{\omega }^{1}_{ji}|,0)) + \hat{\omega }^{1}_{ji} \mathop {\mathrm {sign}}( \max (|\hat{\omega }^{1}_{ji}|- |\hat{\omega }^{1}_{ij}|, 0))\). Furthermore, we denote \({\mathbb E}[X^{(i)}]\) as \((\mu _1^{(i)},\mu _2^{(i)}, \ldots ,\mu _p^{(i)})^T\).

Notice that the constraint of  Eq. (15) is not related to \(\varOmega _S^{tot}\). Therefore \(\widehat{\varOmega }^1_{tot}\) is also the solution of the following optimization problem.

$$\begin{aligned} {\mathop {\hbox {argmin}}_{\varOmega _{tot}}} ||\varOmega _{tot}||_1 \quad \text {Subject to:} \; ||(\varSigma _{tot}\varOmega _{tot} - I_K)||_{\infty } \le \lambda _n. \end{aligned}$$
(16)

In the following discussion, we split the analysis into two cases based on two different moment conditions for each \(X^{(i)}\).

(C1) Exponential tails Suppose there exists a constant \(0< \eta < 0.25\), so that \( \frac{\log p}{n_i} \le \eta \) and

$$\begin{aligned} {\mathbb E}[e^{t(X_j - \mu _j)^2}] \le C < \infty , \text { } \forall \; |t| \le \eta ,\quad \forall j \in \{1,\ldots ,p\} \end{aligned}$$
(17)

where C is a constant.

(C2) Polynomial-type tails Suppose that for some \(\gamma \), \(c_1 > 0\), \(p \le c_1n_i^{\gamma }\), and for some \(\delta > 0\),

$$\begin{aligned} {\mathbb E}[|X_j - \mu _j|^{4\gamma + 4 + \delta }] \le C < \infty \; \; , \quad \forall j \in \{ 1,\ldots ,p\} \end{aligned}$$
(18)

We begin by assuming that the precision matrix \(\varOmega \) belongs to the uniformity class of matrices,

$$\begin{aligned} \mathcal {U} \,{:=}\, \mathcal {U}(q,s_0(p))&= \left\{ \varOmega : \varOmega \succ 0, ||\varOmega ||_\mathbf{1} \le M, \right. \nonumber \\&\quad \left. \max \limits _{1 \le i \le p}\sum \limits _{j=1}^p |\omega _{ij}|^q \le s_0(p)\right\} \end{aligned}$$
(19)

Here q is a constant and \(0 \le q < 1\). \(||\cdot ||_\mathbf{1}\) means the matrix 1-norm. M is a constant representing the upper bound. \(\varOmega =: (\omega _{ij}) = (\omega _1,\ldots ,\omega _p)\). \(s_0(p)\) represents the sparsity level of \(\varOmega \) in the uniformity class. We would like to point out that \(s_0(p)\) is related to p, but there is no analytic form of relationship between \(s_0(p)\) and p.

Lemma 2

Let \(\{\hat{\varOmega }_{tot}^1 \}\) be the solution set of  Eq. (7) and \(\{ \widehat{\mathbf {B}} \} := \{ (\widehat{\mathbf {B}}_I+ \widehat{\mathbf {B}}_S) \}\), where

$$\begin{aligned} \widehat{\mathbf {B}}_I= & {} \begin{pmatrix} \widehat{\mathbf {B}}_I^{(1)} &{}\quad 0 &{}\quad \cdots &{}\quad 0 \\ 0 &{}\quad \widehat{\mathbf {B}}_I^{(2)} &{}\quad \cdots &{}\quad 0 \\ \vdots &{}\quad \vdots &{}\quad \ddots &{}\quad \vdots \\ 0 &{}\quad 0 &{}\quad \cdots &{}\quad \widehat{\mathbf {B}}_I^{(K)} \end{pmatrix}\\ \widehat{\mathbf {B}}_S^0= & {} \begin{pmatrix} \widehat{\mathbf {B}}_S &{}\quad 0 &{}\quad \cdots &{}\quad 0 \\ 0 &{}\quad \widehat{\mathbf {B}}_S &{}\quad \cdots &{}\quad 0 \\ \vdots &{}\quad \vdots &{}\quad \ddots &{}\quad \vdots \\ 0 &{}\quad 0 &{}\quad \cdots &{}\quad \widehat{\mathbf {B}}_S \end{pmatrix} \end{aligned}$$

Here \( \widehat{B}_I^{(i)} = (\widehat{\beta }_1^{(i)},\widehat{\beta }_2^{(i)},\ldots , \widehat{\beta }_p^{(i)})\) and \(\widehat{\mathbf {B}}_S = (\widehat{\beta }_1^s,\widehat{\beta }_2^s,\ldots ,\widehat{\beta }_p^s)\). For \(j \in \{1,2,\ldots , p\}\), \((\widehat{\beta }_j^{(1)},\widehat{\beta }_j^{(2)},\ldots ,\widehat{\beta }_j^{(K)},\widehat{\beta }_j^s)\) is an optimal solution of Eq. (9) when working on the jth column. Therefore, \(\{ \widehat{\varOmega }_{tot}^1 \} =\{ \widehat{\mathbf {B}} \}\).

Lemma 3

If each \(\varOmega _I^{(i)} + \varOmega _S\) satisfies Condition Eq. (19), then \(\varOmega _{tot}\) also satisfies Condition Eq. (19).

Proof

use the definition of \(\varOmega _{tot}\). \(\square \)

Corollary 2

\(\widehat{\varOmega }_{tot}\) satisfies the condition \(\widehat{\varOmega }_{tot} \succ 0\), with a high probability.

Proof

Lemma 3 indicates that \(\varOmega _{tot}\) satisfies Condition Eq. (19). \(\square \)

Theorem 3

Suppose that \(\varOmega ^0_{tot} \in \mathcal {U}(q,s_0(p))\). If \(\lambda _n \ge ||\varOmega ^0_{tot}||_1(\max \nolimits _{ij}|\hat{\sigma }_{ij} - \sigma ^0_{ij}|)\), we have that

$$\begin{aligned}&\left| \left| \widehat{\varOmega }_{tot} - \varOmega ^0_{tot}\right| \right| _{\infty } \le 4 \left| \left| \varOmega ^0_{tot}\right| \right| _1\lambda _n, \end{aligned}$$
(20)
$$\begin{aligned}&\left| \left| \widehat{\varOmega }_{tot} - \varOmega ^0_{tot}\right| \right| _2 \le C_4 s_0(p)\lambda _n^{1-q}, \end{aligned}$$
(21)

and

$$\begin{aligned} \frac{1}{p}||\widehat{\varOmega }_{tot} - \varOmega ^0_{tot}||_F \le C_5 s_0(p)\lambda _n^{2-q}, \end{aligned}$$
(22)

where constant \(C_4 \le 2(1+2^{1-q} + 3^{1-q})(4||\varOmega ^0_{tot}||_1)^{1-q}\) and constant \(C_5 \le 4 ||\varOmega ^0_{tot}||_1 C_4\).

Theorem 4

Suppose that \(\varOmega ^0_{tot} \in \mathcal {U}(q,s_0(p))\) and \(n_{tot}=\sum _{i=1}^K n_i\).

(a) Assume that condition (C1) holds. Let \(\lambda _n = C_0M\sqrt{\frac{\log (Kp)}{n_{tot}}}\), where constant \(C_0 = 2\eta ^{-2}(2 + \tau _0 + \eta ^{-1}e^2C^2)^2\) and constant \(\tau _0 > 0\). Then

$$\begin{aligned}&||\widehat{\varOmega }_{tot} - \varOmega ^0_{tot}||_2 \le C_1M^{2-2q}s_0(p)\left( \frac{\log (Kp)}{n_{tot}}\right) ^{(1-q)/2} \end{aligned}$$
(23)
$$\begin{aligned}&||\widehat{\varOmega }_{tot} - \varOmega ^0_{tot}||_{\infty } \le 4C_0M^2\sqrt{\frac{\log (Kp)}{n_{tot}}} \end{aligned}$$
(24)
$$\begin{aligned}&\frac{1}{p}||\widehat{\varOmega }_{tot} - \varOmega ^0_{tot}||_F^2 \le 4C_1M^{4-2q}s_0(p)\left( \frac{\log (Kp)}{n_{tot}}\right) ^{1-q/2} \end{aligned}$$
(25)

with a probability greater than \(1 - 4p^{-\tau _0}\), where constant \(C_1 \le 2(1+2^{1-q} + 3^{1-q})4^{1-q}C_0^{1-q}\).

(b) Assume that condition (C2) holds. Let \(\lambda _{n} = C_2M\sqrt{\frac{\log (Kp)}{n_{tot}}}\), where constant \(C_2 = \sqrt{(5+\tau _0)(\theta _0 + 1)}\) and

$$\begin{aligned} \theta _0 = \max \limits _{i,j,k} \hat{\varSigma }_{j,k}^{(i)}. \end{aligned}$$
(26)

Then

$$\begin{aligned}&||\hat{\varOmega }_{tot}^0 - \varOmega ^0_{tot}||_2 \le C_3M^{2-2q}s_0(p)\left( \frac{\log (Kp)}{n_{tot}}\right) ^{(1-q)/2} \end{aligned}$$
(27)
$$\begin{aligned}&||\widehat{\varOmega }_{tot} - \varOmega ^0_{tot}||_{\infty } \le 4C_2M^2\sqrt{\frac{\log (Kp)}{n_{tot}}} \end{aligned}$$
(28)
$$\begin{aligned}&\frac{1}{p}||\widehat{\varOmega }_{tot} - \varOmega ^0_{tot}||_F^2 \le 4C_3M^{4-2q}s_0(p)\left( \frac{\log (Kp)}{n_{tot}}\right) ^{(1-q)/2} \end{aligned}$$
(29)

with a probability greater than \(1- O(n_{tot}^{-\delta /8} + p^{-\tau _0/2})\), where constant \(C_3 \le 2(1+2^{1-q} + 3^{1-q})4^{1-q}C_2^{1-q}\).

Through Eq. (23)–(29), we theoretically prove that we can achieve a good estimation of target dependency graphs with the convergence rate \(O(\log (Kp)/n_{tot})\). Based on CLIME (Cai et al. 2011), the convergence rate of single-task sGGM is \(O(\log p/n_i)\). Here \(n_{i}\) represents the number of samples of ith task. Assuming \(n_i = \frac{n_{tot}}{K}\), the convergence rate of single sGGM is \(O(K\log p/n_{tot})\). Clearly, since \(K\log p > \log (Kp)\), the convergence rate of SIMULE is better than single-task sGGM. This provides theoretical proofs for the benefit of multi-tasking sGGM. Neither of these theoretical results have been investigated by the previous studies.

All proofs of above theorems are provided in Sect. 7.

5.2 Theoretical analysis for NSIMULE estimator

In this subsection, we investigate the theoretical properties of NSIMULE estimator and prove that its convergence rate is the same as the SIMULE.

Theorem 5

Based on Liu et al. (2009), Suppose we use the estimated Kendall’s tau correlation matrix \(\widehat{\mathbf {S}}\) to replace \(\widehat{\varSigma }\) in the parametric GLasso estimator. Then under the same conditions on \(\widehat{\varSigma }\) (that ensure the consistency of the estimator under the Gaussian model), the nonparanormal estimator achieves the same (parametric) rate of convergence as GLasso estimator for both the precision matrix estimation and the graph structure recovery.

Theorem 6

We use the estimated Kendall’s tau correlation matrix \(\hat{\mathbf {S}}^{(i)}\) in the Nonparanormal SIMULE estimator. Then under the same conditions on \(\hat{\varSigma }^{(i)}\) (that ensure the consistency of the estimator under the Gaussian model), the nonparanormal SIMULE achieves the same rate of convergence as SIMULE (\(O(\sqrt{\log (Kp)/n_{tot}})\)) for both the graph recovery and precision matrix estimation.

Proof

We can directly apply Theorem 5 (the main theorem of Liu et al. 2009). Our multi-task nonparanormal graph estimation will not change the convergence rate of SIMULE. \(\square \)

5.3 Potential non-identifiability issue

Linear programming is not strongly convex. Therefore, there may be multiple ideal solutions in the SIMULE formulation of Eq. (7) (i.e., identifiability problem). In fact, the CLIME (Cai et al. 2011) estimator may also have multiple optimal solutions. Cai et al. (2011) have proved all such solutions converge to the true one at an optimal convergence rate. Similarly, in Sect. 5.1, we have proved that SIMULE formulation in Eq. (7) may result in multiple optimal solutions \(\{ \hat{\varOmega }_{tot} \}\). Each of these solutions \(\hat{\varOmega }_{tot}\) converges to the true solution with an optimal convergence rate. We present Theorem 7 showing that for each optimal solution \(\hat{\varOmega }_{tot}\), when \(\varepsilon \ne 1\), we can obtain unique estimation of \(\varOmega _S\) and \(\{ \varOmega _I^{(i)} | i = 1,\ldots ,K\}\).

Theorem 7

When we pick \(\varepsilon > 0\) and \(\varepsilon \ne 1\), for each optimal solution \(\hat{\varOmega }_{tot}\) from Eq. (7), there exist unique \(\varOmega _S\) and \(\{ \varOmega _I^{(i)} | i = 1,\ldots ,K\}\) satisfying Eq. (5).

We provide the proof of Theorem 7 in Sect. 7.

In practice, we need to decide \(\varepsilon \) according to the application for which SIMULE is used. For example, for the genome-related biomedical data, we can normally assume that the shared subgraph is more dense than individual interactions of each context.Footnote 4 Therefore we pick \(\varepsilon < 1\) to reflect this assumption in our experiments in Sect. 6.

6 Experiments

In this section, we use four simulated datasets and two real-world bio-datasets to evaluate the proposed estimators.

Table 3 AUC and partial AUC on simulated Gaussian datasets from Model 1
Table 4 AUC and partial AUC on simulated nonparanormal datasets from Model 1

6.1 Experimental settings

6.1.1 Baselines

We compare SIMULE, SIMULE-I and NSIMULE with the following baselines: (1) Three different multi-sGGM estimators including JGL-fused, JGL-group (Danaher et al. 2013), and SIMONE (Chiquet et al. 2011) (with the penalty functions described in Table 1); (2) The single-task CLIME baseline (i.e. each task uses CLIME independently); (3) The single-task nonparanormal CLIME (NCLIME) baseline (i.e., each task uses NCLIME independently). (4) The nonparanormal extension of JGL-fused, JGL-group, and SIMONE.Footnote 5 (5) SIMONE-I (“intertwined Lasso” in SIMONE package) is added as a baseline for comparison with SIMULE-I.Footnote 6

6.1.2 Metric

The edge-level false positive rate (FPR) and true positive rate (TPR) are used to measure the difference between the true graphs and the predicted graphs. By repeating each graph inference process 10 times, we obtain average metrics (also being averaged over K tasks) for each method we test. Here, FPR = \(\frac{\text {FP}}{\text {FP + TN}}\) and TPR = \(\frac{\text {TP}}{\text {TP + FN}}\). For a predicted graph, the TP (true positive) and TN (true negative) mean the number of true nonzero entries and true zero entries respectively. Each FPR vs. TPR curve shows the multi-point performance of a method over a range of its regularization parameter. In Tables 3, 4, 5 and 6 we compare estimators using the area under a FPR–TPR curve (AUC) and the partial area (partial AUC) under a FPR–TPR curve within the range of \(FPR < q\) (e.g., \(q=20\%\) or \(q=5\%\)). For instance, when we choose \(q=5\%\) we care about the partial AUC scores of those regions with very small FPRs. Higher AUC or partial AUC scores indicate that a method has achieved better results overall.

Table 5 AUC and partial AUC on simulated Gaussian datasets from Model 2
Table 6 AUC and partial AUC on simulated nonparanormal datasets from Model 2

6.1.3 Selection of hyper-parameter \(\lambda _{n}\)

Recent research studies from Negahban et al. (2009) and Yang et al. (2014) conclude that the regularization parameter \(\lambda \) of a single-task sGGM (e.g., with \(n_i\) samples) should satisfy \(\lambda \propto \sqrt{\frac{\log p}{n_i}}\). Combining this conclusion with our theoretical analysis in Sect. 5, we choose \(\lambda _{n} = \alpha \sqrt{\frac{\log (Kp)}{n_{tot}}}\) where \(\alpha \) for SIMULE or NSIMULE is a hyper-parameter to tune. In our experiments, \(\alpha \) is varied over a range of \( \{0.05 \times i| i\in \{1,2,3,\ldots ,30 \}\}\). The Bayesian information criterion (BIC) is used for situations requiring to select a specific value of hyper-parameters.

Besides, we also need to tune the hyper-parameters of the baseline methods to obtain their FPR–TPR curves. If only one hyper-parameter needs tuning, we follow the same strategy as SIMULE. For those baselines (JGL-fused and JGL-group) having two hyper-parameters, when given a certain \(\lambda _1\) (the same as \(\lambda _n\)), we use BIC criteria to select its best \(\lambda _2\) from a range of \(\{0.05\times i| i\in \{1,2,3,\ldots ,20 \}\}\).

6.1.4 Selection of hyper-parameter \(\varepsilon \)

\(\varepsilon \) reflects the difference of sparsity in the shared subgraph versus the context-specific subgraphs. Section 5.3 has discussed our choice of \(\varepsilon \) on two real-world datasets. Similarly for the simulated experiments, we select \(\varepsilon \) from a range of \(\{0.1\times i| i\in \{1,2,\ldots ,9\}\}\).

6.2 Simulated Gaussian datasets

Using the following two graph models, we first generate two synthetic multivariate Gaussian datasets, in which each model includes K tasks of data samples.

  • Model 1 Coming from Rothman et al. (2008), this model assumes \(\varOmega ^{(i)} = \mathbf {B}^{(i)}_I + \mathbf {B}_S + \delta ^{(i)}I\), where each off-diagonal entry in \(\mathbf {B}^{(i)}_I\) is generated independently and equal to 0.5 with probability 0.05i and 0 with probability \(1 - 0.05i\). The shared part \(\mathbf {B}_S\) is generated independently and equal to 0.5 with probability 0.1 and 0 with probability 0.9. \(\delta ^{(i)}\) is selected large enough to guarantee the positive definiteness of precision matrix. A clear shared structure \(\mathbf {B}_S\) exists among multiple graphs. We choose \(K \in \{2,3,4,5,6\}\) for this case.

  • Model 2 This model uses two special-structure graphs, i.e., a grid graph and a ring graph. The first task uses a grid graph and the second uses a ring graph. These two special networks are popular in many real-world applications . For instance, certain biological pathways can be represented as rings. A clear shared structure exists between these two graphs. Clearly, \(K = 2\) for this case.

We choose \(p = 100\), i.e., the dimension of feature variables is 100. For each dataset, 500 data samples are generated randomly. Using either Model 1 or Model 2, we generate K blocks of data samples with the ith block following \(N(0, (\varOmega ^{(i)})^{-1})\). In detail, for ith task, we generate 500 simulated data samples following multivariate Gaussian Distribution with mean 0 and covariance matrix \((\varOmega ^{(i)})^{-1}\). We use the multivariate distribution method from stochastic simulation (Ripley 2009) to sample the simulated data blocks. In our implementation, we directly use the R function “mvrnorm” in MASS package.

In summary, we first simulate precision matrices by Model 1 or Model 2. We then use multivariate distribution method (Ripley 2009) to sample multivariate Gaussian Distributed data blocks with mean 0 and covariance matrix \((\varOmega ^{(i)})^{-1}\). This stochastic procedure will generate simulated data blocks with the decomposition in Eq. (5). Then we apply SIMULE, NSIMULE and baseline models on these datasets to obtain the estimated dependency networks.

6.3 Simulated nonparanormal datasets

Using the same graphs of Model 1 and Model 2, we simulate two more sets of data samples following the nonparanormal distributions in K different tasks. We pick \(K=3\) for Model 1 and \(K =2\) for Model 2. Starting from \(N(0, (\varPhi ^{(i)})^{-1})\) and transforming with a monotone function \(x : \rightarrow sign(\mathbf {x})|\mathbf {x}|^{\frac{1}{2}}\), data sample is generated as a random vector Z, where \(sign(Z)Z^2 = (sign(\mathbf {z}_1)\mathbf {z}_1^2,\ldots ,sign(\mathbf {z}_p)\mathbf {z}_p^2) \sim N(0, (\varPhi ^{(i)})^{-1})\). Thus Z follows a nonparanormal distribution. Using different monotone functions for generating simulated data will not change the resulting correlation matrix \(\mathbf {S}\) since we use the rank-based nonparametric estimator estimating the correlation matrix.

6.4 Experimental results on synthetic datasets

Using Tables 3, 4, 5 and 6, we compare all methods using (a) the area under a FPR–TPR curve (AUC), (b) the area under a FPR–TPR curve within the range of \(FPR < 0.2\) (AUC-\(20\%\)), (c) the area under a FPR–TPR curve within the range of \(FPR < 0.05\) (AUC-\(5\%\)), (d) the AUC scores when evaluating only on the shared part (AUC-shared), (e) the AUC scores when evaluating only on the individual parts of multiple graphs (AUC-individual), and (f) AUC scores when changing p (AUC-\(p=200\)). Tables 3 and 4 include results on datasets from graph Model 1. Tables 5 and 6 are about Model 2. Tables 3 and 5 show comparisons on the simulated Gaussian datasets. Tables 4 and 6 are about results on the simulated nonparanormal datasets. Besides, Tables 4 and 6 also provide the results of the nonparanormal implementations of three baselinesFootnote 7—JGL-fused, JGL-group and SIMONE. In summary, we can conclude from these tables that (1) SIMULE and NSIMULE methods obtain better AUC and partial AUC than three multi-sGGM competitors, nonparanormal extensions of these three competitors, and two single-sUGM baselines across Model 1 and Model 2, and across two different values of p. (2) NSIMULE produces better scores than SIMULE on nonparanormal data across both graph models (more apparent on Model 1). Similarly, nonparanormal extensions of three multi-sGGM baselines outperform their Gaussian-versions on the nonparanormal datasets. (3) All multi-task estimators perform better than single-CLIME and single-NCLIME estimators. (4) On datasets from Model 2, the score differences among multi-task estimators are not as apparent as the case of Model 1. (5) SIMULE and NSIMULE achieve better estimation of the individual sub-graphs than other baselines. (6) The columns of “AUC-shared” show that JGL-fused and SIMONE estimate the shared parts better. This is as expected, since their penalty functions have enforced the shared similarity among graphs. (7) Interestingly, in Tables 3 and 5 NSIMULE achieves similar or even better AUC scores on Gaussian datasets. This is actually also as expected, since under a high-dimensional setting the Kendalls tau correlation matrix used in NSIMULE has been proved to provide consistent estimation of correlation matrix with an optimal convergence rate (Liu et al. 2009). Differently, sample covariance matrix used by SIMULE is inconsistent for high-dim cases (Fan et al. 2014). As a future step, large covariance-estimators like POET (Fan et al. 2014) will be used to replace sample-covariance in SIMULE. (8) Related to the previous point, SIMULE-I and SIMONE-I have revised context-specific sample covariance matrices with a bias towards a global empirical covariance matrix over all K tasks. The last two rows in Tables 3 and 5 clearly show that such a simple strategy on covariance estimation has provided significant performance improvement over both SIMULE and SIMONE.

Fig. 3
figure 3

The FPR–TPR curve graph for different methods on four simulated datasets. The upper two are for Gaussian simulated datasets. The lower two are for nonparanormal datasets. The upper left and lower left graphs are generated using graph Model 1. The upper right and lower right are generated using Model 2. Each curve is generated by varying the \(\lambda \) parameter(s). We can see that curves from SIMULE and NSIMULE are above all baseline multi-sGGM methods (more apparent on two datasets generated by Model 1)

Figure 3 shows more detailed FPR–TPR comparisons among the joint sGGM estimators. The sub-figure (a) “Gaussian-Model1” clearly shows that our methods obtain better curves than three multi-sGGM baselines. On the sub-figure (b) “Gaussian-Model2”, the differences among multi-sGGM estimators are not as apparent as “Gaussian-Model1”. Figure 3c, d (in the second row) show FPR–TPR curves from two nonparanormal datasets. Overall, the observations from comparing these curves are consistent with those obtained from the four tables (Tables 3, 4, 5, 6).

Since the difference among multi-task estimators is not apparent on the datasets from Model 2, we further visualize the derived graph structures from all joint sGGM estimators using a black-white color-map visualization (i.e., black for nonzero entries and white for zero entries of precision matrix) in Fig. 4. The left two columns of Fig. 4 present all color maps we have obtained on “Gaussian-Model 2”. The right two columns show all color maps from “nonparanormal-Model 2”. The first row represents the original grid graph and ring graph from Model 2. Since these two graph structures are clearly related, we can observe a diagonal band shared by the two color-maps. “Graph 1” grid graph has two more off-diagonal narrow bands while “Graph 2” ring graph does not. The color maps of predicted graphs with proper tuning parameters from the NSIMULE and SIMULE method s are presented as the second row and the third row. Three other multi-sGGM baselines are shown on the fourth, fifth and the sixth rows. Instead of pushing two graphs to be too similar, SIMULE and NSIMULE methods achieve a better recovery of both the ring graph and the grid graph. In contrast, three multi-task baseline methods all obtain an incorrect graph on the second task (ring graph) by pushing it to be too similar to the first task (grid graph). The single task baselines (CLIME as the 8th row and NCLIME as the 7th row) obtain two much worse graphs because they do not model the relationship among the two graphs at all.

Fig. 4
figure 4

On two simulated datasets generated from Model 2, the four estimated conditional-dependency graphs from all methods are compared using black-white color-maps. In total we visualize two true graph (grid and ring) and fourteen predicted dependency networks (by SIMULE, NSIMULE, three multi-sGGM baselines, CLIME and NCLIME). The left two columns are derived from the simulated Gaussian Model 2 dataset. The right two columns are derived from the simulated nonparanormal Model 2 dataset. We can conclude that three multi-sGGM baseline methods obtain incorrect recoveries since they push two tasks to be similar through penalty functions. In contrast, SIMULE and NSIMULE [row (2), (3)] achieve better edge predictions by modeling both joint and task-specific parameters

Fig. 5
figure 5

a, c Changes of sparsity level of the individual and the shared subgraphs when varying the hyperparameter \(\varepsilon \). e Changes of AUC when varying the hyperparameter \(\varepsilon \). The results in (a), (c), (e) are obtained from running SIMULE on Gaussian-Model 1 case. The upper right sub-figure (b) uses the training time (in \(\log \)-seconds) versus the varying p, to compare the “column-per-core” parallel implementation of SIMULE with the original SIMULE implementation, JGL-fused, JGL-group, SIMONE and SIMONE-I. Right sub-figures (d, f) show AUC scores from SIMULE, three multi-sGGM baselines, SIMONE-I and SIMULE-I against varying p and against varying K

One important hyper-parameter we need to pick for SIMULE and NSIMULE is \(\varepsilon \) in Eq. (7). This hyper-parameter reflects the sparsity level of the shared subgraph against the context-specific parts. The left two sub-figures (a) and (c) of Fig. 5 show the changes of sparsity level in both individual and shared subgraphs across multiple values of \(\varepsilon \) (by running SIMULE on the Gaussian-Model 1 case). Figure 5a is for \(K=3\) and Fig. 5c is for \(K= 6\). We can see that when \(\varepsilon \) increases, the sparsity level of shared portion decreases while the averaging sparsity of individual parts increases across both cases of K. This matches our analysis in Sect. 5.3. In real applications, \(\varepsilon \) indicates the differences of sparsity constraints we assume on shared and individual parts. It should be chosen according to the domain knowledge of a specific application.

In addition, Fig. 5e tries to investigate whether the changes of \(\varepsilon \) influence the performance (AUC) of (N)SIMULE. On datasets from Model 1, Fig. 5d shows that AUC scores of both SIMULE and NSIMULE exhibit very small variations across a large range of changing \(\varepsilon \).

Section 2.5 presents a parallel variation of SIMULE. On the simulated data of Gaussian-Model 1, we compare the training speed of original-SIMULE versus parallel-SIMULE using Fig. 5b (the right upper sub-figure). For the parallel-SIMULE, we run SIMULE by paralleling “column per core” using 63 cores on a 64-core machine. The baselines, including the original-SIMULE, JGL-fused, JGL-group, SIMONE and SIMONE-I, are run on one restricted core on the same machine.Footnote 8 Figure 5b provides the computational speed (training time in \(\log \)-seconds) across values of dimension p. It clearly shows that the parallel-SIMULE runs much faster than the single-core SIMULE implementation and other baselines.Footnote 9

Furthermore, Fig. 5d, f provide AUC scores of SIMULE, three multi-SGGM baselines, SIMONE-I and SIMULE-I against the varying p and against the with varying K. SIMULE-I provides a consistent performance improvement over SIMULE. SIMULE outperforms three multi-sGGM baselines across multiple values of p and K. SIMULE-I outperforms SIMONE-I and runs smoothly for larger values of K. Unfortunately Fig. 5f can not provide the AUC scores of SIMONE-I for \(K=5\) and \(K=6\), because the SIMONE package could not converge for some \(\lambda \) values under these two cases of K.

6.5 Experiment results on real application I: identifying gene interaction using gene expression data across two cell contexts

Next, we apply SIMULE and the baselines on one real-world biomedical data: gene expression profiles describing many human samples across multiple cancer types (aggregated by McCall et al. 2011). Recently advancements in genome-wide monitoring have resulted in enormous amounts of data across most of the common cell contexts, like multiple common cancer types (The Cancer Genome Atlas Research Network 2011). Complex diseases such as cancer are the result of multiple genetic and epigenetic factors. Thus, recent research has shifted towards the identification of multiple genes/proteins that interact directly or indirectly in contributing to certain disease(s). Structure learning of UGMs on such heterogeneous datasets can uncover statistical dependencies among genes and understand how such dependencies vary from normal to abnormal or across different diseases. These structural variations are highly likely to be contributing markers that influence or cause the diseases.

Two major cell contexts are selected from the human expression dataset provided by McCall et al. (2011): leukemia cells (including 895 sample s) and normal blood cells (including 227 samples). Then we choose the top 1000 features from the total 12,704 features (ranked by variance) and apply SIMULEFootnote 10), JGL-fused, JGL-group, SIMONE, single-CLIME, SIMONE-I and SIMULE-I on this two-task dataset. The derived dependency graphs are compared by using the number of predicted edges being validated by three major existing protein/gene interaction databases (Prasad et al. 2009; Orchard et al. 2013; Stark et al. 2006).Footnote 11 The numbers of matches between interactions in databases and those edges predicted by each method have been shown as a bar graph in the left subfigure of Fig. 6. This bar graph clearly shows that SIMULE consistently outperforms three multi-sGGM baselines and CLIME on both individual and shared interactions from both cell contexts. Interestingly, SIMULE-I outperforms SIMONE-I and SIMULE. SIMONE-I achieves a better recovery of known edges than JGL-group, JGL-fused, SIMONE and CLIME. This leaves us to believe that exploring more choices of covariance matrix estimation to extend SIMULE is promising.

Fig. 6
figure 6

Validating the recovered molecule dependencies using existing bio-databases (Prasad et al. 2009; Orchard et al. 2013; Stark et al. 2006) of experimentally validated known protein-protein interactions (PPI) or gene interactions in humans. The left sub-figure (a) is obtained from a multi-cell gene expression data. The right sub-figure (b) is obtained from a multi-cell TF-binding data. The “shared” bars from the baseline methods are obtained using the intersection among the predicted graphs. The number of matches among predicted edges and known interactions is shown as bar lines

6.6 Experiment results on real application II : identifying collaborations among TFs across multiple cell types

In molecular biology, the regulatory proteins that interact with one another to control gene transcription are known as transcription factors (TFs). TF proteins typically perform major cell regulatory functions (e.g., binding on DNA) by working together with other TFs. The collaboration patterns (e.g., conditional independence) among TFs normally vary across different cell contexts (e.g., cell lines). Meanwhile, a certain portion of the TF interactions are preserved across contexts. Understanding the collaboration networks among TFs is the key to understanding cell development, including defects, which lead to different diseases. The ChIP-Seq datasets recently made available by the ENCODE project (ENCODE Project Consortium 2011) provide simultaneous binding measurements of TFs to thousands of gene targets. These measurements provide a “snapshot” of TF binding events across many cell contexts. The task of uncovering the functional dependencies among TFs connects to the task of discovering the statistical dependencies among TFs from their ChIP-Seq measurements.

Recently, two relevant papers (Cheng et al. 2011; Min et al. 2014) have discussed methods to infer co-association networks among TFs using ChIP-Seq data. Their approaches differ from ours as both projects has targeted a single cell type at a time. We select ChIP-Seq data for 27 TFs that are covered by ENCODE (ENCODE Project Consortium 2012) across three major human cell lines including (1) H1-hESC (embryonic stem cells : primary tissue), (2) GM12878 (B-lymphocyte:normal tissue) and (3) K562 (leukemia:cancerous tissue).

We apply SIMULE, SIMULE-I, JGL-group, JGL-fused, SIMONE and SIMONE-I to this multi-cell TF ChIP-Seq dataset. Comparisons of different methods are performed using three major existing protein interaction databases (Prasad et al. 2009; Orchard et al. 2013; Stark et al. 2006). The numbers of matches between TF-TF interactions in databases and those predicted by each method have been plotted as a bar graph shown in Fig. 6b (the right subfigure). The graph shows that SIMULE consistently outperforms JGL-fused, JGL-group and SIMONE on both individual and shared interactions from all three cell types. SIMULE-I performs better than SIMONE-I and SIMULE. We further evaluated the resulting TF interactions using the popular “functional enrichment” analysis with DAVID (Da Wei Huang and Lempicki 2008) and found that SIMULE and SIMULE-I can reveal known functional sets and potentially novel interactions that drive leukemia. This leads us to believe that our approach can be used in a wider range of applications as well.

Many domain-specific studies have applied sGGM on real-world datasets, especially those from molecular biology or brain science. For example, Ma et al. (2007) estimates gene networks using miroarray data in the model species Arabidopsis. Krumsiek et al. (2011) explores GGM to reconstruct pathway reactions from high-throughput metabolomics data. Within brain science, a few studies (Ng et al. 2013; Huang et al. 2010; Monti et al. 2015; Sun et al. 2009) have tried to learn brain connectivity of Alzheimer’s disease through sparse inverse covariance estimation. Due to space limit, we omit detailed inclusions of these studies.

7 Proof of Theorems

In this section, I(S) denotes the indicator function of the set S.

7.1 Lemma 2

Proof

In the following, we just show the case of \(1 \le j \le p\). When \(j > p\), by replacing \(e_{j \mod p}\) in the following proofs, it will still hold. \(||\varSigma _{tot}\hat{\varOmega }_{tot}^1 - I||_{\infty } \le \lambda _n \) is equivalent to \(||\varSigma _{tot}\hat{\omega }_j^1 - e_j||_{\infty } \le \lambda _n\) for \(1 \le j \le p\).

Thus we have

$$\begin{aligned} \left| \left| \hat{\omega }_j^1\right| \right| _1 \ge ||\hat{\beta }_j||_1 \text { } \; \forall 1\le j \le p. \end{aligned}$$
(30)

Also \(||\varSigma _{tot}\hat{B} - I||_{\infty } \le \lambda _n\). So by the definition of \(\{\widehat{\varOmega }_{tot}^1 \}\), we have that

$$\begin{aligned} \left| \left| \widehat{\varOmega }_{tot}^1\right| \right| _1 \le ||\widehat{\mathbf {B}}||_1. \end{aligned}$$
(31)

With Inequality (30) and Inequality (31), we have that \(\widehat{\mathbf {B}} \in \{\widehat{\varOmega }_{tot}^1\}\). On the other hand, if \(\widehat{\varOmega }_{tot}^1 \notin \{ \widehat{\mathbf {B}} \}\), then \(\exists i \text { such that } ||\hat{\omega }_j^1||_1 > ||\hat{\beta }_j + \hat{\beta }^s_j||_1\). Thus by Inequality (30), we have that \(|\widehat{\varOmega }^1|_1 > ||\widehat{\mathbf {B}}||_1\). This is in conflict with Inequality (31). \(\square \)

7.2 Theorem 3

Proof

By the condition in Theorem 3,

$$\begin{aligned} ||\varSigma _{tot}^0 - \varSigma _{tot}||_{\infty } \le \frac{\lambda _n}{||\varOmega _{tot}^0||_1}. \end{aligned}$$
(32)

we then have

$$\begin{aligned}&\left| \left| I - \varSigma _{tot}\varOmega _{tot}^0\right| \right| _{\infty } = \left| \left| \left( \varSigma _{tot}^0 - \varSigma _{tot}\right) \varOmega _{tot}^0\right| \right| _{\infty } \nonumber \\&\quad \le \left| \left| \varOmega _{tot}^0\right| \right| _1\left| \left| \varSigma _{tot}^0 - \varSigma _{tot}\right| \right| _{\infty } \le \lambda _n. \end{aligned}$$
(33)

where \(|\mathbf {A}\mathbf {D}|_{\infty } \le |\mathbf {A}|_{\infty }|\mathbf {D}|_1\) for any matrices \(\mathbf {A}\),\(\mathbf {D}\) of appropriate sizes.

Since \(\varOmega _{tot}^0\) also satisfies Eq. (16) and \(\widehat{\varOmega }_{tot}^1\) is the optimal solution which satisfies Eq. (16)

$$\begin{aligned} \left| \left| \widehat{\varOmega }_{tot}^1 \right| \right| _1 \le \left| \left| \varOmega _{tot}^0\right| \right| _1. \end{aligned}$$
(34)

we have that

$$\begin{aligned}&\left| \left| \varSigma _{tot}\left( \widehat{\varOmega }_{tot}^1 - \varOmega _{tot}^0\right) \right| \right| _{\infty } \nonumber \\&\quad \le \left| \left| \varSigma _{tot}\widehat{\varOmega }_{tot}^1 - I\right| \right| _{\infty } + \left| \left| I - \varSigma _{tot}\varOmega _{tot}^0\right| \right| _{\infty } \le 2\lambda _n. \end{aligned}$$
(35)

Therefore, based on Inequality (32) to Inequality (35)

$$\begin{aligned}\left| \left| \varSigma _{tot}^0\left( \widehat{\varOmega }_{tot}^1 - \varOmega _{tot}^0\right) \right| \right| _{\infty }\le & {} \left| \left| \varSigma _{tot}\left( \widehat{\varOmega }_{tot}^1 - \varOmega _{tot}^0\right) \right| \right| _1 + \left| \left| \left( \varSigma _{tot} - \varSigma _{tot}^0\right) \left( \widehat{\varOmega }_{tot}^1 - \varOmega _{tot}^0\right) \right| \right| _{\infty } \\\le & {} 2\lambda _n + \left| \left| \widehat{\varOmega }_{tot}^1 - \varOmega _{tot}^0\right| \right| _1\left| \left| \varSigma _{tot} - \varSigma _{tot}^0\right| \right| _{\infty } \le 4\lambda _n. \end{aligned}$$

Then,

$$\begin{aligned} \left| \left| \widehat{\varOmega }_{tot}^1 - \varOmega _{tot}^0\right| \right| _{\infty }\le \left| \left| \varOmega _{tot}^0\right| \right| _1\left| \left| \varSigma _{tot}^0\left( \widehat{\varOmega }_{tot}^1 - \varOmega _{tot}^0\right) \right| \right| _{\infty } \le 4\left| \left| \varOmega _{tot}^0\right| \right| _1\lambda _n. \end{aligned}$$

This proves Inequality (20) of Theorem 3.

To prove the Inequality (21), let \(t_n = ||\widehat{\varOmega }_{tot} - \varOmega _{tot}^0||_{\infty }\) and define

$$\begin{aligned} \mathbf {h}_j= & {} \hat{\omega }_j - \omega ^0_j\\ \mathbf {h}_j^1= & {} (\hat{\omega }_{ij}I\{ |\hat{\omega }_{ij}|\ge 2t_n \}; 1\le i \le p)^T - \omega _j^0\\ \mathbf {h}_j^2= & {} \mathbf {h}_j - \mathbf {h}_j^1. \end{aligned}$$

By the definition of \(\widehat{\varOmega }_{tot}\), we have that \(||\hat{\omega }_j||_1\le ||\hat{\omega }^1_j||_1 \le ||\omega ^0_j||_1\). Then

$$\begin{aligned} \left| \left| w^0_j\right| \right| _1 - \left| \left| \mathbf {h}^1_j\right| \right| _1 + \left| \left| \mathbf {h}^2_j\right| \right| _1 \le \left| \left| \omega ^0_j + \mathbf {h}^1_j\right| \right| _1 + \left| \left| \mathbf {h}^2_j\right| \right| _1 = \left| \left| \hat{\omega }_j\right| \right| _1 \le \left| \left| \omega _j^0\right| \right| _1, \end{aligned}$$

which implies that \(||\mathbf {h}^2_j||_1 \le ||\mathbf {h}^1_j||_1\). It follows that \(||\mathbf {h}_j||_1 \le 2||\mathbf {h}_j^1||_1\). Thus we only need to get the upper bound of \(||\mathbf {h}_j^1||_1\). We have that

$$\begin{aligned} \left| \left| \mathbf {h}_j^1\right| \right| _1&= \sum \limits ^p_{i=1}\left| \hat{\omega }_{ij}I\left\{ \left| \hat{\omega }_{ij}\right| \ge 2t_n \right\} - \omega _{ij}^0\right| \nonumber \\&\le \sum \limits _{i = 1}^p\left| \omega _{ij}^0I\left\{ \left| \omega _{ij}^0\right| \le 2t_n \right\} \right| +\sum \limits _{i=1}^p\left| \hat{\omega }_{ij}I\left\{ \left| \hat{\omega }_{ij}\right| \ge 2t_n \right\} - \omega _{ij}^0I\left\{ \left| \omega _{ij}^0\right| \le 2t_n \right\} \right| \nonumber \\&\le (2t_n)^{1-q}s_0(p) + t_n\sum \limits _{i = 1}^p I\left\{ \left| \hat{\omega }_{ij}\right| \ge 2t_n\right\} \nonumber \\&\quad + \sum \limits _{i=1}^p\left| \omega _{ij}^0\right| \left| I\left\{ \left| \hat{\omega }_{ij}\right| \ge 2t_n \right\} - \omega _{ij}^0 - I\left\{ \left| \omega _{ij}^0\right| \le 2t_n \right\} \right| \nonumber \\&\le 2(t_n)^{1-q}s_0(p) + t_n\sum \limits _{i = 1}^{p}I\left\{ \left| \omega _{ij}^0\right| \ge t_n \right\} \nonumber \\&\quad +\sum \limits _{i = 1}^p\left| \omega _{ij}^0\right| I\left\{ \left| \left| w_{ij}^0\right| - 2t_n\right| \le \left| \hat{\omega }_{ij} - \omega _{ij}^0\right| \right\} \nonumber \\&\le (2t_n)^{1-q}s_0(p) + (t_n)^{1-q}s_0(p) + (3t_n)^{1-q}s_0(p)\nonumber \\&\le (1 + 2^{1-q}+3^{1-q})t_n^{1-q}s_0(p). \end{aligned}$$
(36)

This proves the  Inequality (21) of  Theorem 3. For any \(a,b,c\in {\mathbb R}\), we have that \(|I\{ a<c\} - I\{ ab<c\}|\le I\{ |b-c|<|a-b|\}\)

Finally, Inequality (22) can be derived from Inequality (20), Inequality (36) and the inequality relationship \(||\mathbf {A}||^2_{F} \le p||\mathbf {A}||_1||\mathbf {A}||_{\infty }\). \(\square \)

7.3 Theorem 4

Proof

Theorem 4(a)

In the beginning of Sect. 5, we define that \(\varSigma _{tot} :=(\sigma _{ij})_{Kp\times Kp} \). By Theorem 3, we only need to prove

$$\begin{aligned} \max \limits _{ij}|\hat{\sigma }_{ij} - \sigma ^0_{ij}|\le C_0\sqrt{\log Kp/n} \end{aligned}$$
(37)

with probability greater than \(1-4p^{-\tau _0}\) under (C1). Without loss of generality, we assume that \({\mathbb E}X^{(i)} = 0\). Let \(\hat{\varSigma }_{tot}^0 := n_{tot}^{-1}\sum \limits ^n_{k=1}{\mathbf {X}_{tot}}_k{\mathbf {X}_{tot}}_k^T\) and \(\mathbf {Y}_{kij} = {\mathbf {X}_{tot}}_{ki}{\mathbf {X}_{tot}}_{kj} - {\mathbb E}{\mathbf {X}_{tot}}_{ki}{\mathbf {X}_{tot}}_{kj}\). We then have \(\hat{\varSigma }_{tot} = \hat{\varSigma }_{tot}^0 -\bar{\mathbf {X}}_{tot}\bar{\mathbf {X}}_{tot}^T\). Let \(t = \eta \sqrt{\log Kp/n_{tot}}\).

Now we use the inequality \(|e^s - 1 - s|< s^2e^{\max (s,0)}\) for any \(s \in {\mathbb R}\) and let

$$\begin{aligned} C_{K1} = 2 + \tau _0 + \eta ^{-1}C^2. \end{aligned}$$
(38)

with a few step of basic calculations we can get

$$\begin{aligned}&{\mathbb P}\left( \sum \limits _{k=1}^{n_{tot}} \mathbf {Y}_{kij} \ge \eta ^{-1}C_{K1}\sqrt{n_{tot}\log Kp}\right) \\&\quad \le e^{-C_{K1}\log Kp}({\mathbb E}\text {exp}(tY_{kij}))^n_{tot}\\&\quad \le \text {exp}\left( -C_{K1}\log Kp + n_{tot}t^2{\mathbb E}Y_{kij}^2e^{t|Y_{kij}|}\right) \\&\quad \le \text {exp}(-C_{K1}\log Kp + \eta ^{-1}C^2\log Kp)\\&\quad \le \text {exp}(-(\tau _0 + 2)\log Kp). \end{aligned}$$

Thus, we have

$$\begin{aligned} {\mathbb P}\left( \left| \left| \hat{\varSigma }_{tot}^0 - \varSigma _{tot}^0\right| \right| _{\infty } \ge \eta ^{-1}C_{K1}\sqrt{\log Kp / n_{tot}}\right) \le 2p^{-\tau _0}. \end{aligned}$$
(39)

Then using the simple inequality \(e^s \le {e}^{s^2+1}\) for \(s>0\), we have

$$\begin{aligned} {\mathbb E}e^{t|X_j|}\le eC,\quad \forall t \le \eta ^{\frac{1}{2}}. \end{aligned}$$
(40)

Let

$$\begin{aligned} C_{K2} = 2 +\tau _0 + \eta ^{-1}e^2C^2 \end{aligned}$$
(41)

and \(a_n = C_{K2}^2(\log Kp /n_{tot})^{\frac{1}{2}}\). As before, we can show that

$$\begin{aligned}&{\mathbb P}(||\bar{\mathbf {X}}\bar{\mathbf {X}}^T||_{\infty } \ge \eta ^{-2} a_n\sqrt{\log Kp/n_{tot}})\nonumber \\&\quad \le p \max \limits _i{\mathbb P}\left( \sum \limits _{k=1}^{n_{tot}}\mathbf {X}_{ki} \ge \eta ^{-1}C_{K2}\sqrt{n_{tot}\log Kp}\right) \nonumber \\&\quad \quad + p\max \limits _i{\mathbb P}\left( -\sum \limits _{k =1}^{n_{tot}}\mathbf {X}_{ki} \ge \eta ^{-1}C_{K2}\sqrt{n_{tot}\log Kp}\right) \nonumber \\&\quad \le 2p^{-\tau _0-1}. \end{aligned}$$
(42)

By Inequality (39), Inequality (42), and the inequality formulation \(C_0 > \eta ^{-1} C_{K1} + \eta ^{-2}a_n\), we can prove that Inequality (37) holds. \(\square \)

Proof

Theorem 4(b)

Let \(\bar{\mathbf {Y}}_{kij} = \mathbf {X}_{ki}\mathbf {X}_{kj}I\{|\mathbf {X}_{ki}\mathbf {X}_{kj}|\le \sqrt{n_{tot}/(\log Kp)^3}\} - {\mathbb E}\mathbf {X}_{ki}\mathbf {X}_{kj}I\{|\mathbf {X}_{ki}\mathbf {X}_{kj}|\le \sqrt{n_{tot}/(\log Kp)^3}\}\), \(\check{\mathbf {Y}}_{kij} = \mathbf {Y}_{kij} - \bar{\mathbf {Y}}_{kij}\).

Since

\(b_n := \max _{i,j}{\mathbb E}|\mathbf {X}_{ki}\mathbf {X}_{kj}|I\{|\mathbf {X}_{ki}\mathbf {X}_{kj}|\le \sqrt{n_{tot}/(\log Kp)^3}\} = O(1)n_{tot}^{-\gamma - \frac{1}{2}}\), we have, by condition (C2),

$$\begin{aligned}&{\mathbb P}\left( \max _{i,j}\left| \sum \limits _{k=1}^{n_{tot}} \check{\mathbf {Y}}_{ijk}\right| \ge 2n_{tot}b_n\right) \\&\quad \le {\mathbb P}\left( \max _{i,j}\left| \sum \limits _{k=1}^{n_{tot}} \mathbf {X}_{ki}\mathbf {X}_{kj}I\{|\mathbf {X}_{ki}\mathbf {X}_{kj}\right| > \sqrt{n_{tot}/(\log Kp)^3}\}| \ge n_{tot}b_n\right) \\&\quad \le {\mathbb P}\left( \max _{i,j}\sum \limits _{k=1}^{n_{tot}} |\mathbf {X}_{ki}\mathbf {X}_{kj}|I\{ \mathbf {X}_{ki}^2 + \mathbf {X}_{kj}^2 \ge 2\sqrt{n_{tot}/(\log Kp)^3} \} \ge nb_n\right) \\&\quad \le {\mathbb P}\left( \max _{i,j}\mathbf {X}^2_{ki} \ge \sqrt{n_{tot}/(\log Kp)^3}\right) \\&\quad \le pn_{tot}{\mathbb P}\left( \mathbf {X}_{11}^2 \ge \sqrt{n_{tot}/(\log Kp)^3}\right) \\&\quad =O(1)n_{tot}^{-\frac{\delta }{8}}. \end{aligned}$$

By Bernstein’s inequality (cf. Bennett 1962) and some elementary calculations,

$$\begin{aligned}&{\mathbb P}\left( \max _{i,j}\left| \sum \limits _{k=1}^{n_{tot}}\bar{\mathbf {Y}}_{kij}\right| \ge \sqrt{(\theta _0 + 1)(4+\tau _0)n_{tot}\log Kp}\right) \\&\quad \le p^2\max _{i,j}{\mathbb P}\left( \left| \sum \limits _{k=1}^{n_{tot}} \bar{\mathbf {Y}}_{kij}\right| \ge \sqrt{(\theta _0 + 1)(4 + \tau _0)n_{tot}\log Kp}\right) \\&\quad \le 2p^2\max _{i,j}\text {exp}(-(\theta _0 + 1)(4 + \tau _0)n_{tot}\log Kp)/(2n_{tot}{\mathbb E}\bar{\mathbf {Y}}_{1ij}^2 \\&\quad \quad + \sqrt{(\theta _0 + 1)(64+16\tau _0)}n_{tot}/(3\log Kp) )\\&\quad =O(1)p^{-\tau _0/2}. \end{aligned}$$

Therefore, we have

$$\begin{aligned} \mathbb {P}\Big (\left| \left| \hat{\varSigma }_{tot}^0 - \varSigma _{tot}^0\right| \right| _{\infty }\ge & {} \sqrt{(\theta _0+1)(4+\tau _0)\log Kp/n_{tot}}+2b_n\Big )\nonumber \\= & {} O\left( n_{tot}^{-\delta /8} + p^{-\tau _0/2}\right) . \end{aligned}$$
(43)

By using the same truncation argument and Bernstein’s inequality, we can show that \({\mathbb P}(\max \nolimits _i|\sum \nolimits _{k=1}^{n_{tot}}\mathbf {X}_{ki}|\ge \sqrt{\max \nolimits _i \sigma ^0_{ii}(4+\tau _0)n_{tot}\log Kp}) = O(n_{tot}^{-\delta /8} + p^{-\tau _0/2})\).

Therefore,

$$\begin{aligned} {\mathbb P}\left( ||\bar{\mathbf {X}}\bar{\mathbf {X}}^T||_{\infty } \ge \max \limits _i \sigma _{ii}^0(4+\tau _0)\log Kp/n_{tot}\right) = O\left( n_{tot}^{-\delta /8} + p^{-\tau _0/2}\right) \end{aligned}$$
(44)

Combining  Eq. (43) and  Eq. (44), then, we have

$$\begin{aligned} \max \limits _{ij}|\hat{\sigma }_{ij} - \sigma ^0_{ij}| \le \sqrt{(\theta _0+1)(5+\tau _0)\log Kp/n_{tot}} \end{aligned}$$
(45)

with a probability greater than \(1-O(n_{tot}^{-\delta /8} + p^{-\tau _0/2})\). The proof is completed by Inequality (45) and Theorem 3. \(\square \)

7.4 Theorem 7

Proof

Assume that there exists two different optimal solution \((\hat{\varOmega }^{(i)}_I, \hat{\varOmega }_S)\) and \((\hat{\varOmega }^{(i)\prime }_I, \hat{\varOmega }_S')\), which satisfies Eq. (7) and \(\hat{\varOmega }^{(i)}_I + \hat{\varOmega }_S = \hat{\varOmega }^{(i)\prime }_I + \hat{\varOmega }_S' \). Since they are the optimal solutions, by Lemma 1, we have that \(\hat{\varOmega }^{(i)}_{I,j,k}\hat{\varOmega }_{S,j,k} \ge 0\) and \(\hat{\varOmega }^{(i)\prime }_{I,j,k}\hat{\varOmega }_{S,j,k}' \ge 0\). Therefore, \(|\hat{\varOmega }^{(i)}_{I,j,k}| + |\hat{\varOmega }_{S,j,k}| = |\hat{\varOmega }^{(i)}_{I,j,k} + \hat{\varOmega }_{S,j,k}|\) and \(|\hat{\varOmega }^{(i)\prime }_{I,j,k}| + |\hat{\varOmega }_{S,j,k}'| = |\hat{\varOmega }^{(i)\prime }_{I,j,k} + \hat{\varOmega }_{S,j,k}'|\). Now we get

$$\begin{aligned} ||\varOmega ^{(i)}||_1 = \left| \left| \varOmega ^{(i)}_I\right| \right| _1 + ||\varOmega _S||_1 \end{aligned}$$
(46)

Then we denote \(\hat{\varOmega }^{(i)} = \hat{\varOmega }^{(i)}_I + \hat{\varOmega }_S\) and \(\hat{\varOmega }^{(i)\prime } = \hat{\varOmega }^{(i)\prime }_I + \hat{\varOmega }_S'\). Combining Eq. (7), Eq. (5) and Eq. (46), \((\hat{\varOmega }^{(i)}, \hat{\varOmega }_S)\) and \((\hat{\varOmega }^{(i)\prime }, \hat{\varOmega }_S')\) are both the optimal solution of

$$\begin{aligned}&{\mathop {\hbox {argmin}}\limits _{\varOmega ^{(i)},\varOmega _S}} \sum \limits _i ||\varOmega ^{(i)}||_1 + K(\varepsilon - 1) ||\varOmega _S||_1\nonumber \\&\quad \text {Subject to: }||\varSigma ^{(i)}\varOmega ^{(i)} - I||_{\infty } \le \lambda _n,\quad i=1,\ldots , K \end{aligned}$$
(47)

Because the constrains are not related to \(\varOmega _S\) and because we have that \(\hat{\varOmega }^{(i)}_{I,j,k} + \hat{\varOmega }_{S,j,k} = \hat{\varOmega }^{(i)}_{j,k} = \hat{\varOmega }^{(i)\prime }_{j,k} = \hat{\varOmega }^{(i)\prime }_{I,j,k} + \hat{\varOmega }_{S,j,k}'\). By Lemma 1, \(|\hat{\varOmega }^{(i)}_{I,j,k}| + |\hat{\varOmega }_{S,j,k}| = |\hat{\varOmega }^{(i)\prime }_{I,j,k}| + |\hat{\varOmega }_{S,j,k}'| = |\hat{\varOmega }^{(i)}_{j,k}|\). When \(\varepsilon < 1\), the optimization of Eq. (47) changes to the following:

$$\begin{aligned}&\mathop {\mathrm {argmax}}\limits _{\varOmega _S} ||\varOmega _S||_1\nonumber \\&\quad \text {Subject to: } \left| \varOmega _{S,j,k}\right| \le \left| \hat{\varOmega }^{(i)}_{j,k}\right| , \varOmega _{S,j,k}\hat{\varOmega }^{(i)}_{j,k} \ge 0 \end{aligned}$$
(48)

Equation (48) gives the unique solution \(\hat{\varOmega }_{S,j,k} = \min \nolimits _i(|\hat{\varOmega }_{j,k}^{(i)}|) \mathop {\mathrm {sign}}({\mathop {\hbox {argmin}}\nolimits _{\hat{\varOmega }_{j,k}^{(i)}}} |\hat{\varOmega }_{j,k}^{(i)}|)\). The uniqueness contradicts the assumption of \(\hat{\varOmega }_S \ne \hat{\varOmega }_S'\). For \(\varepsilon > 1\), similarly we can prove, there exist unique \(\varOmega _S\) and \(\{ \varOmega _I^{(i)} | i = 1,\ldots ,K\}\). \(\square \)

8 Conclusions

This paper introduces a novel method SIMULE for learning shared and distinct patterns simultaneously when inferring multiple sGGMs or sNGMs jointly. Through \(\ell \)1-constrained formulation our solution is efficient and can be parallelized. We successfully apply SIMULE on four synthetic datasets and two real-world datasets. We prove the convergence property of SIMULE to be favorable and justify the benefit of multitasking. Future work will extend SIMULE to model more complex relationships among contexts.