A constrained L1 minimization approach for estimating multiple Sparse Gaussian or Nonparanormal Graphical Models

The flood of multi-context measurement data from many scientific domains have created an urgent need to reconstruct context-specific variable networks, that can significantly simplify network-driven studies. Computationally, this problem can be formulated as jointly estimating multiple different, but related, sparse Undirected Graphical Models (UGM) from samples aggregated across several tasks. Previous joint-UGM studies could not fully address the challenge since they mostly focus on Gaussian Graphical Models (GGM) and have used likelihood-based formulations to push multiple estimated networks toward a common pattern. Differently, we propose a novel approach, SIMULE (learning Shared and Individual parts of MULtiple graphs Explicitly) to solve multi-task UGM using a l1 constrained optimization. SIMULE can handle both multivariate Gaussian and multivariate Nonparanormal data (greatly relaxing the normality assumption most real data do not follow). SIMULE is cast as independent subproblems of linear programming that can be solved efficiently. It automatically infers specific dependencies that are unique to each context as well as shared substructures preserved among all the contexts. Theoretically we prove that SIMULE achieves a consistent estimation at rate O(log(Kp)/ntot) (not been proved before). On four synthetic datasets and two real datasets, SIMULE shows significant improvements over state-of-the-art multi-sGGM and single-UGM baselines.


Introduction
In this paper, we consider the problem of jointly estimating K undirected graphical models (UGM) from K multivariate data blocks.Each data block contains a different set of data observations, on the same set of variables.This is motivated by the fact that the past decade has seen a revolution in collecting large-scale heterogeneous data from many scientific fields like genetics or brain science.Given such data, understanding and quantifying variability of variable graphs across multiple contexts is a fundamental analysis task.This is because quantifying and analyzing interaction edges (representing statistical dependencies) between molecules, that are activated only under a specific context, can predict or help to understand the existence or severity of such a context.In order to address the above issue, we propose a novel approach that uses a 1 constrained minimization formulation for joint structure learning of multiple sparse UGMs.We name the method SIMULE (learning Shared and Individual parts of MULtiple graphs Explicitly).Using a 1 constrained optimization strategy (Section 2), SIMULE extends CLIME [2] to a multi-tasking setting.The learning step is solved efficiently through a formulation of multiple independent sub-problems of linear programming for which we also provide a parallel version of the learning algorithm.Compared with previous multi-task sGGM models, SIMULE can accurately quantify task-specific network variations that are unique to each task.This also leads to a better generalization and benefits all the involved tasks.We provide evaluations through both experimental results and theoretical analysis.Theoretically we prove that SIMULE and its nonparanormal extension NSIMULE achieve a consistent estimation of the target (true) dependency graphs with a high probability at rate O(log(Kp)/n tot ).Here n tot represents the total number of samples from all tasks, p means number of features and K describes the number of tasks.Experimentally, we show significant improvement in performance of SIMULE and NSIMULE over multiple baseline methods on four synthetic datasets.The proposed methods obtain better AUC and partial AUC scores on all simulated cases.
2 Method SIMULE: Infer Shared and Individual Parts of Mulltiple sGGM Explicitly: Treating sparse GGM estimation from each data block as a single task, our main task is to learn multiple sGGMs over K data blocks or K tasks jointly.This can lead to better generalization and benefit all of the involved tasks (theoretically proved in Section 2.2).First we model each GGM network as: where Ω S is the shared pattern among all graphs and Ω (i) I represents the individual part specific for the i-th graph.Second, we choose the CLIME [2] estimator to model each sGGM task and we sum up multiple CLIME estimators from each task for multi-tasking.This gives us the following novel formulation of SIMULE: Subject to: . ., K In Section 2.2, we theoretically prove that the estimated Ω (i) will be positive definite and converge to the true precision matrices with a high probability.
Eq. ( 1) is equivalent to a linear programming problem.To solve it, we follow the primal dual interior method [1] that has also been used in the Dantzig selector for regression [3].Similar to CLIME estimator, we can solve Eq. ( 1) column by column without influencing the resulting solution.Method Variation: nonparanormal SIMULE (NSIMULE) Though sGGM is powerful, its normality assumption is commonly violated in real applications such as in in the case of biomedical data we analyze in Section 3. The nonparanormal and transelliptical Graphical Models recently proposed by [11] have extended sGGM to new distribution families.Therefore we extend SIMULE to a novel variation, nonparanormal SIMULE (NSIMULE) that learns to fit multiple nonparanormal graphical models (NGM) [12] jointly.The nonparanormal graphical model (NGM) [12] assumes that data samples follow a multivariate nonparanormal distribution, which is a strict superset of Gaussian distribution.Its modeling and optimization is same as the Gaussian version of SIMULE(Eq.( 1)).

Related Work
Most previous methods that estimate multiple sGGMs jointly follow the penalized log-likelihood formulation.They differ from each other by using different penalty functions for pushing the multiple graphs towards a common pattern.We choose three most relevant studies that also provide runnable software as our baselines in the experiments: (1) Joint graphical lasso (JGL) [7]; (2) Node-perturbed JGL [15]; and (3) Simone [5].JGL used the fused graphical lasso and group graphical lasso to penalize the difference between each of the two graphs.Node-based JGL proposed a novel penalty, RCON or "row-column overlap norm" for capturing special relationship among graphs.Simone tried three different penalties including such as 2 norm of each graph.These studies do not explicitly model individual parts of the graphs.Furthermore, [9] proposed to estimate a population graph from multi-block data using median-graph idea.It is conceptually similar to the shared part modeled by SIMULE.However, they do not explicitly model individual parts that are specific to each task.Another recent study CSSL-GGM [10] also tried to model both the shared and individual substructures in multi-sGGMs.Different from ours, their formulation follows the penalized likelihood framework and uses 1,p norm to regularize the task-specific parts.The 1,p norm they use actually pushes the individual parts of multiple graphs to be similar which is contradictory to the original purpose of these parameters.More recently [16] proposed a method to learn population and subject-specific brain connectivity networks via a so-called Mixed Neighborhood Selection (MSN) method.Since MSN is specially designed for brain imaging data, it assumes each individual graph is generated through random effects modeling by latent variables.While MSN may fit brain imaging data better, our model is more general

Theoretical Analysis
We theoretically prove that we can achieve a good estimation of target dependency graphs with the convergence rate O(log(Kp)/n tot ) (details are not shown due to space limitation).We also prove that the nonparametric variation NSIMULE will not change this convergence rate.
Based on CLIME, the convergence rate of single-task sGGM is O(log p/n i ).Here n i represents the number of samples of i-th task.Assuming n i = ntot K , the convergence rate of single sGGM is O(K log p/n tot ).Clearly, since K log p > log(Kp), the convergence rate of SIMULE is better than single-task sGGM.This provides theoretical proofs for the benefit of multi-tasking sGGM.Both of these theoretical results haven not been investigated by the previous studies.Simulated Data: We use four simulated datasets to evaluate the proposed methods.Using two network models 1 , we first generate two synthetic multivariate Gaussian datasets and two nonparanormal datasets, in which each model includes K tasks of data samples (We choose K = 2 for simplicity).We compare our model with the baseline methods mentioned in Section 2.1 plus single CLIME and single-task Nonparanormal CLIME.The edge-level false positive rate (FPR) and true positive rate (TPR) are used to measure the difference between the true graphs and the predicted graphs.The FPR vs. TPR curves generated by different methods are provided in Figure 1.The first row of Figure 1 presents FPR vs. TPR plots from SIMULE, NSIMULE and the baseline methods on two simulated Gaussian datasets from Model 1 and Model 2. The subfigure "Gaussian-Model1" clearly shows that our methods obtain better under-the-plot areas than three multi-sGGM baselines and two single-sGGM baselines.We can also conclude all multi-task estimators perform better than single-CLIME and single-NCLIME estimators.On the subfigure "Gaussian-Model2", all multi-task estimators perform better than single-CLIME and single-NCLIME estimators.The differences among various multi-sGGM estimators are less apparent in the subfigure "Gaussian-Model2".

Experiment
Real Data I: Furthermore, we compare the proposed methods and the baselines on one real-world data: gene expression profiles for multiple human samples across different cancer types (aggregated by [13]).Complex diseases such as cancers are a result of multiple genetic and epigenetic factors.Recent research, therefore, has shifted toward the identification of multiple genes/proteins that interact directly or indirectly in contributing to certain disease(s).Structural variations discovered by UGMs on such heterogeneous datasets are highly likely to be contributing factors that influence or cause the diseases.
We first select two major cell contexts from the human expression dataset [13]: leukemia cells (including 895 sample) and normal blood cells (including 227 samples).We choose top 1000 features from the total 12,704 features (according to their ranks by variance).Next, we apply NSIMULE (more robust to non-Gaussian), JGL-fused, JGL-group, Simone and single-NCLIME on this two-task dataset.On the derived dependency graphs, we perform comparisons by using the number of predicted edges that are validated by three major existing protein/gene interaction databases [18,17,19].The numbers of matches between interactions in databases and edges predicted by each method have been plotted as a bar graph in Figure 2 (shown as the right subfigure).This bar graph clearly shows that our proposed method consistently outperforms baselines on both individual and shared interactions from the two cell contexts.
2: Validate the found molecule dependencies using existing bio-databases [18,17,19] of experimentally validated known protein-protein interactions (PPI) or gene interactions in human.The left graph is obtained from the TF-binding data.The right graph is obtained from the gene expression data.The "shared" bars from the baseline methods are obtained using the intersection among the predicted graphs.The number of matches among predicted edges and known interactions is shown as bar lines.These bar graphs clearly show NSIMULE finds more matches than baseline methods.
Real Data II: In molecular biology, the regulatory proteins that interact with one another to control gene transcription are known as transcription factors (TFs).TF proteins typically perform major cell regulatory functions (e.g., binding on DNA) by working together with other TFs.The collaboration patterns (e.g., conditional independence) among TFs normally vary across different cell contexts (e.g., cell lines).Meanwhile, a certain portion of the TF interactions are preserved across multiple cell contexts.Understanding the inter-relational networks among the TFs is key in understanding cell development, including defects, which lead to different diseases.The ChIP-Seq datasets recently made available by the ENCODE project [6] provide simultaneous binding measurements of TFs to thousands of gene targets.These measurements provide a "snapshot" of TF binding events within various cell contexts.Thus, the task of uncovering the functional dependencies among TFs connects to the task of discovering the statistical dependencies among TFs [4,14] from their ChIP-Seq measurements.We selected ChIP-Seq data for 27 TFs that are common for three major cell lines in humans (1) H1-hESC (embryonic stem cells : primary tissue), ( 2) GM12878 (B-lymphocyte : normal tissue) and (3) K562 (leukemia : cancerous tissue).The experimental setting is same as for Real Data I.The numbers of matches between TF-TF interactions in databases and those predicted by each method have been plotted as a bar graph shown in the left subfigure of Figure 2. The graph shows clearly that our proposed method consistently outperforms others on both individual and shared interactions from all three cell types.We further evaluated the resulting TF interactions using the popular "functional enrichment" analysis through DAVID tools [8].The enrichment analysis table ( not included due to space limit) clearly shows that the NSIMULE method generates more favorable enrichment than the baseline methods.In summary, the graphs generated by the proposed method provide a simple yet informative way to describe the collaborative interactions among TFs across multiple cell contexts.This leads us to believe that our approach can be used in a wider range of applications as well.

Figure 1 :
Figure 1: The FPR-TPR curve graph for different methods on four simulated datasets.The upper two are for Gaussian simulated datasets.The lower two are for nonparanormal datasets.The upper left and lower left graphs are generated using graph Model 1.The upper right and lower right are generated using Model 2. Each curve is generated through varying the tuning parameter(s).We can see that curves from SIMULE and NSIMULE are above all baseline methods (more apparent on two datasets generated by Model 1).