BRANE Cut: biologicallyrelated a priori network enhancement with graph cuts for gene regulatory network inference
 1.2k Downloads
 3 Citations
Abstract
Background
Inferring gene networks from highthroughput data constitutes an important step in the discovery of relevant regulatory relationships in organism cells. Despite the large number of available Gene Regulatory Network inference methods, the problem remains challenging: the underdetermination in the space of possible solutions requires additional constraints that incorporate a priori information on gene interactions.
Methods
Weighting all possible pairwise gene relationships by a probability of edge presence, we formulate the regulatory network inference as a discrete variational problem on graphs. We enforce biologically plausible coupling between groups and types of genes by minimizing an edge labeling functional coding for a priori structures. The optimization is carried out with Graph cuts, an approach popular in image processing and computer vision. We compare the inferred regulatory networks to results achieved by the mutualinformationbased Context Likelihood of Relatedness (CLR) method and by the stateoftheart GENIE3, winner of the DREAM4 multifactorial challenge.
Results
Our BRANE Cut approach infers more accurately the five DREAM4 in silico networks (with improvements from 6 % to 11 %). On a real Escherichia coli compendium, an improvement of 11.8 % compared to CLR and 3 % compared to GENIE3 is obtained in terms of Area Under PrecisionRecall curve. Up to 48 additional verified interactions are obtained over GENIE3 for a given precision. On this dataset involving 4345 genes, our method achieves a performance similar to that of GENIE3, while being more than seven times faster. The BRANE Cut code is available at: http://wwwsyscom.univmlv.fr/~pirayre/CodesGRNBRANEcut.html.
Conclusions
BRANE Cut is a weighted graph thresholding method. Using biologically sound penalties and datadriven parameters, it improves three stateofthe art GRN inference methods. It is applicable as a generic network inference postprocessing, due to its computational efficiency.
Keywords
Network inference Reverse engineering Discrete optimization Graph cuts Gene expression data DREAM challengeBackground
Gene expression microarray techniques and highthroughput sequencingbased experiments furnish numerical data for gene regulatory process characterization. Gene Regulatory Network (GRN) inference provides a framework to transform highthroughput data into meaningful information. It consists of the construction of graph structures that highlight regulatory links between transcription factors and their target genes. GRNs are used as an initial step for experimental condition analysis or network interpretation, for instance classification tasks [1], leading to more insightful biological knowledge extraction. It may also directly offer genetic targets for specific experiments, such as directed mutagenesis and/or knockout procedures.
Despite the large variety of proposed GRN inference methods, building a GRN remains a challenging task due to the nature of gene expression and the structure of the experimental data. It notably involves data dimensionality, especially in terms of gene/replicate/condition proportions. Indeed, gene expression data obtained from microarrays or highthroughput technologies correspond to the expression profiles of thousands of genes. Expression profiles reflect gene expression levels for different replicates or strains studied in different physicochemical, temporal or culture medium conditions. Although the cost of biological experiments diminishes, gene expression data is often acquired under a limited number of replicates and conditions compared to the number of genes. This causes difficulties in properly inferring gene regulatory networks and in recovering reliable biological interpretations of such networks. Continuous efforts from the bioinformatics community, partly driven by the organization of the DREAM challenges [2], hitherto allowed for constant progresses in GRN inference efficiency.
GRN inference approaches are often cleaved into two classes of methods [3, 4]: modelbased or informationtheoretic scorebased. The latter notably employs mutualinformation measures, which quantify the mutual dependence or the information shared by stochastic phenomena. They are used in frequently mentioned and compared GRN methods, for instance: Relevance Network (RN) [5], Algorithm for the Reconstruction of Accurate Cellular Network (ARACNE) [6], Minimum Redundancy NETwork (MRNET) [7], or Context Likelihood of Relatedness (CLR) [8]. CLR was shown to outperform RN, ARACNE and MRNET on several datasets [8]. While RN removes edges whose mutual information is lower than a threshold, CLR exhibits improved performance by computing a score derived from Zstatistics on mutualinformation, leading to more robust results. Modelbased methods include Bayesian approaches, Gaussian graphical models [9, 10], or differential equations [4, 11]. Graphical models rely on strong hypotheses on data distribution, that may yield poor performance when tested on real datasets where the number of replicates or conditions is very small in proportion to the number of genes. The performance of such inferred networks can be sensibly improved by a network deconvolution approach ([12], thereafter denoted by ND) that removes global transitive or indirect effects by eigendecompositions. Differential equation approaches are often restricted by limitedsize time course data. The more recent GENIE3 (GEne Network Inference with Ensemble of trees) [13] and Jump3 [14] approaches prevent such a pitfall by avoiding assumptions on the data. Instead, they formulate the graph inference as a feature selection problem, and learn a ranking of edge presence probability. A drawback of modelbased versus mutualinformationbased approaches is a rather high computational cost on standardsize networks.
The problem of network inference boils down to finding a set of edges that (hopefully) represents actual regulations between genes, given their expression profiles. As we search for a set of regulatory edges, the outcome can be related to an integer binary solution: presence or absence for each genetogene edge. From this framework, we incorporate additional structural a priori based on biological observations and assumptions. They control different connectivity aspects involving particular genes coding for transcription factors. Such supplementary information from heterogeneous data sources, when available, supports the network inference process [15]. We then translate the network inference problem into a variational formulation as detailed in the ‘Mathematical modeling of the structural a priori’ section. Our approach generalizes classical inference. A first additional penalty influences the degree of connectivity of transcription factors and target genes. A second constraint promotes edges related to coregulation mechanisms. The obtained integer programming problem may be solved by finding a maximal flow in a graph, as explained in the ‘Optimization strategy’ section. This approach, known as Graph cuts, is wellinvestigated in the computer vision and image processing literature, where it has demonstrated computational efficiency in a large number of tasks [16].
 1.
We introduce BRANE Cut, a novel BiologicallyRelated A priori Network Enhancement for gene regulation based on Graph cuts. Previous Graph cuts formulations in bioinformatics were employed only for clustering in biological network analysis [17] or for feature selection in the GenomeWide Association Study context [18].
 2.
The proposed method generalizes standard regulatory network inference by incorporating additional terms with biological interpretation. Since their regularization parameters are estimated from gene set cardinality, it can be applied to various transcriptomic data.
 3.
The computation time of our method is negligible in comparison with other modelbased approaches with inference improvements.
 4.
It can be used as a generic GRN postprocessing with any input weights and supplementary information on transcription factors.
The paper is organized as follows: we propose in the next section the novel variational approach for building GRNs and we detail the efficient optimization strategy used to solve the related minimization problem. BRANE Cut outcomes and performance on benchmark datasets coming from the DREAM4 challenge and the Escherichia coli compendium are provided in the ‘Results and discussion’ section. We finally conclude and offer perspectives.
Methods
Mathematical modeling of the structural a priori
We first introduce our notations before detailing our structural models and variational formulation. Let G represent the total number of genes for which expression data is collected. Expression data is gathered in a symmetric weighted adjacency matrix \(\mathbf {W} \in \mathbb {R}^{G\times G}\). Its (i,j) element corresponds to a statistical measure reflecting the strength of the link, or information shared, between the expression profiles of gene i∈{1,…,G} and gene j∈{1,…,G}. Our approach uses nonnegative weights. A convenient choice for ω_{ i,j } is the normalized mutual information (ω_{ i,j }∈ [ 0,1]) computed between the expression profiles of genes i and j.
Let us comment the first term in the above functional. In order to select edges of strong weights ω_{ i,j }, the first term reflects a biological data fidelity term. It represents a genetogene edge deletion cost. Thus, if ω_{ i,j } is large (respectively, small), its edge deletion cost is high (respectively, low), disfavoring (respectively, favoring) its deletion. We now explore the two last penalty terms of (2) corresponding to our biologicallyrelated structural a priori regularization.
Hence, \({\overline {\text {TF}}}\textrm {}{\overline {\text {TF}}}\) edges have weights assigned to 2η. The parameter λ_{TF} (respectively, \({\lambda _{\overline {\text {TF}}}}\)) acts in the neighborhood of TF genes (respectively, \({\overline {\text {TF}}}\) genes). They may be interpreted as two threshold parameters. This double threshold promotes grouping between strong and weaker edges among functionallyrelated genes. A similar approach is used in image segmentation [21] to enhances object detection with reduced sensitivity to irrelevant features [22]. To promote \({\text {TF}}\textrm {}{\overline {\text {TF}}}\) interactions, the λ_{TF} parameter should be greater than \({\lambda _{\overline {\text {TF}}}}\). To ensure that any TF involved interaction is selected first, we should verify that \(\eta \geq {\lambda _{\text {TF}}} \geq {\lambda _{\overline {\text {TF}}}}\). Additionally, removing all \({\overline {\text {TF}}}\textrm {}{\overline {\text {TF}}}\) edges amounts to setting their corresponding x_{ i,j } to zero. Consequently, η should exceed the maximum value of the weights ω. Since we address different data types and input weight distributions, we can easily renormalize them all to ω_{ i,j }∈ [ 0,1], and choose η=1. When \({\lambda _{\text {TF}}}={\lambda _{\overline {\text {TF}}}}\), no distinction is made on the type of edges. This is equivalent to using a unique threshold value, as in classical gene network thresholding. This can be interpreted as if, without further a priori, all genes were indistinguishable from putative TFs. However, different λ_{TF} and \({\lambda _{\overline {\text {TF}}}}\) may be beneficial. We indeed show in the Additional file 1 that for any fixed value of λ_{TF}, smaller values for \({\lambda _{\overline {\text {TF}}}}\) improve graph inference results. A simple linear dependence \({\lambda _{\text {TF}}} = \beta {\lambda _{\overline {\text {TF}}}}\), with β≥1 suffices to define a generalized inference formulation encompassing the classical formulation. We fixed here β as a parameter based on the gene/TF cardinal ratio: \(\beta = \frac {\mathcal {V}}{\mathcal {T}}\). This choice is consistent when no a priori is formulated on the TFs (i.e. all genes are considered as putative TFs). Hence, β=1 and \({\lambda _{\text {TF}}}={\lambda _{\overline {\text {TF}}}}\). As mentioned above, without knowledge on TFs, we recover classical gene network thresholding. The λ_{ i,j } parameter now only depends on a single free parameter \({\lambda _{\overline {\text {TF}}}}\), similarly to the large majority of inference methods requiring a final thresholding step on their weights.
where 𝟙 is the characteristic function (equals to 1 when the condition in argument is satisfied and 0 otherwise) and μ≥0 is a regularization parameter controlling the impact of the third term on the global cost. The proposed numerator counts the number of \({\overline {\text {TF}}}\) genes coregulated by j and j^{′}. As we exclude the gene i, the maximal number of \({\overline {\text {TF}}}\) genes coregulated by j and j^{′} equals \(\mathcal {V} \backslash \mathcal {T}1\). Hence, using the latter quantity as the denominator casts the \(\phantom {\dot {i}\!}\rho _{i,j,j^{\prime }} / \mu \) parameter as a coregulation probability relative to couples of TFs (j,j^{′}). The greater the coregulation probability, the stronger the influence of the third term. This penalty requires that at least two \({\overline {\text {TF}}}\) target genes exist (hence, the denominator does not vanish). Otherwise, when \(\mathcal {T} = \mathcal {V}\), we set μ=0.
We now turn our attention to the strategy for computing an optimal labeling vector x^{∗} solution to Problem (2).
Optimization strategy
By using elements from graph theory, we now explain how a maximum flow algorithm can solve Problem (2). It relies on the maximum flow/minimum cut duality [24]: the computation of an optimal edge labeling minimizing (2) can be performed by maximizing a flow in a network G_{ f }.
A flow (or transportation) network \(\mathcal {G}_{f}\) is a directed, weighted graph including two specific nodes, called source (a node with 0in degree) and sink (a node with 0out degree), respectively denoted by s and t. We recall that the degree of a node is defined as the number of edges incident to that node.
We now introduce the concept of flow in the transportation network \(\mathcal {G}_{f}\). A flow function f assigns a real value to each edge under two main constraints: the capacity limit and the flow conservation. The capacity limit constraint entails that the flow in each edge has to be less than or equal to the capacity (i.e. the weight) of this edge. If the flow equals the capacity, the edge is said saturated. The flow conservation constraint signifies that, at each node, the entering flow equals the exiting flow. Subject to these two constraints, the aim is to find the maximal flow from s to t in the flow network \(\mathcal {G}_{f}\). According to the graph construction rules provided by [25], the flow network for solving Problem (2) is composed of:

A set of ε nodes a_{ i,j } with (i,j)∈{1,…,G}^{2}, i<j, linked to the source s with edges of weight ω_{ i,j }. Each node is associated with a label x_{ i,j }.

A set of G nodes v_{ k } of labels y_{ k } with k∈{1,…,G}. The nodes v_{ k } is linked to the previously defined node a_{ i,j } if k=i or k=j. If such an edge exists, a weight λ_{ k } is thus assigned. In reference to (3), the weight λ_{ k } equals η, λ_{TF} or \({\lambda _{\overline {\text {TF}}}}\), according to the nature of the node a_{ i,j } (corresponding to the edge e_{ i,j } in the initial network \(\mathcal {G}\)).

A set of q edges, linking nodes a_{ i,j } to \(\phantom {\dot {i}\!}a_{i,j^{\prime }}\) for which the regulator coupling property is satisfied, with weights equal to \(\phantom {\dot {i}\!}\rho _{i,j,j^{\prime }}\).

An additional set of G edges, linking nodes v_{ i }, i∈{1,…,G} to the sink node t.
Problem dimension reduction
This formula also provides a better insight into the role played by thresholding parameters λ_{ i,j }. We now have a fast optimization strategy to generate a solution to the proposed variational formulation. One of the advantages of employing the BRANE Cut algorithm is the optimality guaranty of the resulting inferred network with respect to the proposed criterion. We next describe quantitative gains that can be achieved using BRANE Cut.
Results and discussion
We compare the BRANE Cut approach to the top performing graph inference methods on synthetic and real data. The considered stateoftheart methods are CLR, which outperforms ARACNE and Relevance Networks on the E. coli dataset, and GENIE3, winner of the DREAM4 multifactorial challenge [27] on synthetic data among a large number of competing methods, and also outperforming other approaches on the real E. coli dataset. For a fair evaluation, all networks are inferred using the same set of parameters for a given method: CLR results are computed with the ‘plos’ method and the default values for the two quantization parameters. GENIE3 outcomes are obtained using the Random Forest method and \(K = \sqrt {\mathcal {T}}\). We also postprocessed both the CLR and GENIE3 weights with ND (network deconvolution [12]), and applied BRANE Cut on both the deconvolved NDCLR and NDGENIE3 networks.
Validation datasets
The DREAM4 dataset
The Dialogue for Reverse Engineering Assessments and Methods fourth (DREAM4) [27] multifactorial challenge provides five simulated datasets with real network topologies from the prokaryote E. coli and the eukaryote S. cerevisiae, and simulated expression data. At the time of the challenge, the competing approaches did not have access to a list of putative transcription factors, which is now available online. As this list is a requirement of our method, we benchmark the best performing network inference methods using this additional information. The networks are composed of 100 genes, with a total of 100 expression levels per gene. The evaluation of the inferred networks was performed using the gold standard provided in the DREAM4 multifactorial challenge.
The Escherichia coli dataset
This dataset was first introduced in [8] and is composed of 4345 gene expression profiles, each profile containing 445 gene expression levels. This compendium contains steadystate and timecourse expression profiles. RegulonDB [28] is the primary database on transcriptional regulation in Escherichia coli K12 containing manually curated knowledge from original scientific publications. As in [8], we used the version 3.9 to evaluate inferred networks. This database offers a set of 1211 genes for which 3216 regulatory interactions are confirmed.
The DREAM5 dataset
The DREAM5 challenge (Dialogue for Reverse Engineering Assessments and Methods fifth) [2] provides four networks. The first one contains an insilico dataset while the three others correspond to real datasets. For the four networks, the list of putative transcription factors is known. In this work, we used the three networks (1, 3 and 4) for which a ground truth is provided. The first network is composed of 1643 genes (195 TFs) and expression data in 805 conditions. Network 3 contains information about 4511 genes (334 TFs) in 805 conditions, while network 4 compiles 5950 genes (333 TFs) and 536 conditions. The evaluation of the inferred networks was performed using the gold standard provided in the DREAM5 challenge.
Evaluation measures
where TP is the number of true positive, FP is the number of false positive and FN is the number of false negative. The Precision value indicates the proportion of correctly inferred edges compared to the total number of inferred edges. The Recall value reveals the proportion of correctly inferred edges compared to the total number of expected edges given by the gold standard. In order to evaluate and to rank the different tested methods, PrecisionRecall (PR) curves are commonly used [8]. As the best results correspond to both high precision and high recall values, the Area Under the PrecisionRecall Curve (AUPR) is an appropriate quantitative criterion to measure the quality of an inference method (higher is better).
Results on DREAM4
BCX corresponds to the BRANE Cut method initialized with the weights of the method X
Network index  1  2  3  4  5  Average 

(a) Area Under PrecisionRecall for the CLR, NDCLR, GENIE3, NDGENIE3 and BRANE Cut methods on the DREAM4 dataset.  
For each given network, the two maximal improvements are reported in bold  
CLR  0.256  0.275  0.314  0.313  0.318  0.295 
BCCLR  0.282  0.308  0.343  0.344  0.356  0.327 
GENIE3  0.269  0.288  0.331  0.323  0.329  0.308 
BCGENIE3  0.298  0.316  0.357  0.344  0.352  0.333 
NDCLR  0.254  0.250  0.324  0.318  0.331  0.295 
BCNDCLR  0.271  0.277  0.334  0.335  0.343  0.312 
NDGENIE3  0.263  0.275  0.336  0.328  0.354  0.309 
BCNDGENIE3  0.275  0.312  0.367  0.346  0.368  0.334 
(b) Relative gain obtained using BRANE Cut on different initial weights: CLR, NDCLR, GENIE3, NDGENIE3 on the DREAM4 dataset  
BCCLR vs CLR  10.1 %  11.8 %  9.1 %  9.9 %  11.9 %  10.6 % 
BCGENIE3 vs GENIE3  10.7 %  9.9 %  7.8 %  6.5 %  7.0 %  8.4 % 
BCNDCLR vs NDCLR  6.6 %  10.7 %  3.0 %  5.5 %  3.7 %  5.9 % 
BCNDGENIE3 vs NDGENIE3  4.4 %  13.4 %  9.2 %  5.4 %  3.8 %  7.2 % 
(c) Postprocessing method comparison on the DREAM4 dataset. Relative gain are given for BRANE Cut using CLR (resp. GENIE3)  
weights compared to NDCLR (resp. NDGENIE3)  
BCCLR vs NDCLR  11 %  23.2 %  5.9 %  8.2 %  7.5 %  11.2 % 
BCGENIE3 vs NDGENIE3  13.8 %  14.9 %  6.2 %  4.9 %  −0.6 %  7.7 % 
Computed AUPR in Table 1(a) highlight in bold that, globally, first and second best performances are always produced with BRANE Cut. Furthermore, each method tested (CLR, GENIE3, NDCLR or NDGENIE3) used as initialization exhibits an improved AUPR with BRANE Cut postprocessing. Indeed, the average improvement reaches 10.6 % based on the CLR weights, 8.4 % for the GENIE3 weights, 5.9 % with NDCLR weights and 7.2 % compared to the NDGENIE3 weights, see Table 1(b).
We finally compare ND and BRANE Cut as postprocessing methods on original weights. As shown in Table 1(c), BRANE Cut outperforms network deconvolution except for a practically unnoticeable degradation on the fifth network for GENIE3 weights. This relative improvement is essentially due to the fact that network deconvolution degrades results on the first two networks.
Based on the AUPR criterion, we conclude that BRANE Cut outperforms stateoftheart methods. Specifically, singlethreshold results are sensibly refined by our approach, regardless of initial weights.
Results on the E. coli dataset
Best performance is expected toward the upper right side of PrecisionRecall curves, with both high recall and precision. However, GRN PrecisionRecall curves traditionally exhibit low Precision values over the whole curve on real datasets due to the difficulty in inferring accurate regulation relationships among large amounts of genes. For instance with the E. coli dataset, we observe that a recall below 0.05 corresponds to small inferred graphs, with less than 300 edges and a high precision (more than 75 %). Due to their higher predictive power and their readability, such small networks are often preferred by biologists. Hence, we focus on the upperleft part of the PrecisionRecall curves in Fig. 5, emphasized in a closeup, corresponding only to high precision and small graphs. Here, BRANE Cut initialized with GENIE3 weights proves to be the best performer on smaller graphs (less than 100 edges corresponding to a recall below 0.02). However, graphs of larger size (up to a recall of 0.08) are more accurately reconstructed with CLR and BRANE Cut initialized with CLR weights. Again, the BRANE Cut approach improves the prediction results of both CLR and GENIE3.
Area Under PrecisionRecall, computation times and relative gains on the E. coli dataset using BRANE Cut with CLR or GENIE3 weights
CLR  BCCLR  GENIE3  BCGENIE3  

AUPR (×10^{−2})  7.86  8.79  8.90  9.17 
Total comput. time (min)  41.0  41.05  303  303.05 
Gain  11.8 % AUPR gain over CLR  3.0 % AUPR gain over Genie3  
7.4 × faster than Genie3  Negligible additional computation cost 
Comparison of graph inference in terms of number of true positive edges and recall at constant precision using GENIE3 or BRANE CutGENIE3 on the E. coli dataset
Precision (%)  Recall (%)  

GENIE3  BRANE Cut  
90  0.55  2.00 
85  2.31  3.40 
80  3.77  3.86 
75  4.19  4.55 
Precision (%)  TP edges  
GENIE3  BRANE Cut  
90  18  66 
85  75  112 
80  124  127 
75  138  150 
Inferred network example on E. coli
Influence of the proposed structural a priori
We start to analyze the influence of our first a priori on the E. coli dataset using CLR weights. Hence, using the first two terms with \({\lambda _{\text {TF}}} \neq {\lambda _{\overline {\text {TF}}}}\) leads to an AUPR of 0.0870, which constitutes a relative improvement of 10.7 % over CLR, without coregulation a priori. More generally, as λ_{TF} and λ_{TF} are interpreted as a pair of thresholds, the higher these parameters, the greater the stringency in the inferred graph. These results show that allowing a different threshold value in the neighborhood of transcription factors than for other genes does play a positive role by itself. The regulator coupling term controlled by μ brings further improvements. Indeed, the addition of the third term results in an AUPR equal to 0.0879, corresponding to a relative improvement of 11.8 % over CLR. The corresponding PrecisionRecall curves are displayed in the Additional file 1. They show that even if the gain brought by the coregulation a priori is shallower than the improvement allowed by the first a priori, it remains valuable despite its localization in the high Precisions area.
Algorithmic and computational complexity
As previously mentioned in the ‘Optimization strategy’ section, we used the C++ code implemented by [26]. Using this algorithm, the computational complexity of BRANE Cut is O(mn^{2}C), where m (respectively n) is the number of edges (respectively the number of nodes) in the flow network \(\mathcal {G}_{f}\), and C the cost of the minimal cut. Specifically, in our case (without the dimension reduction trick presented in the ‘Problem dimension reduction’ section) the number of nodes in the flow network is equal to the sum of the number of edges ε in the initial network, the number of genes G plus two additional nodes (source and sink). The number of edges n is equal to \(\frac {3}{2}G^{2} + q\), where q is the number of edges coding for the coregulation a priori. Note that, as mentioned in [26], this complexity is not the best achievable by a max flow algorithm. Meanwhile, their experiments showed better performance for several typical computer vision problems. Not being in a computer vision setting, we could benefit from faster max flow algorithms. However, since the time spent on max flow computation to infer the large graph of Escherichia coli is small (only several seconds), the benefit would not be noticeable.
Given precomputed weights, our algorithm requires 30 additional seconds to infer the E. coli network, without using the simplification described in the ‘Problem dimension reduction’ section. By computing the explicit solution to our problem on a subset of edges, we improve BRANE Cut computation times by a factor of 10. Given CLR weights computed in 41 minutes on a Intel Core i7, 2.70 GHz laptop, our algorithm thus only requires three additional seconds. We note that the weight computation duration of GENIE3 are sensibly longer (5 h), using the list of transcription factors. If one wished to build a E. coli network that would also contain \({\overline {\text {TF}}}\textrm {}{\overline {\text {TF}}}\) interactions using GENIE3, it would take 20 minutes per gene, for a total of two months with a basic rule of three.
Results on DREAM5
We have evaluated BRANE Cut on three DREAM5 networks (1, 3 and 4) for which a validation exists. BRANE Cut parameters are initialized with the proposed heuristics and results are obtained using the validation procedure previously detailed. BRANE Cut outperforms CLR and GENIE3 by 7 % and 5 % respectively on Network 1. The improvement is 2.8 % and 2.1 % for Network 3. For the fourth network, the maximum Precision only reaches about 35 %. The AUPR computed with every method is exceptionally low. As such, the relative AUPR differences are insignificant, within the numerical precision. The detailed AUPR are given in the Additional file 1. Regarding the results in these additional datasets, the proposed heuristics lead to improvements over stateoftheart.
Conclusions
By using structural a priori that are often available but rarely used, we managed to infer networks that recover more true interactions than previous approaches, on both synthetic and real datasets. We have expressed the graph inference as an optimization problem, and used the generic Graph cuts approach, very popular in computer vision, to compute the optimal edge labeling of our inferred graph. Comparisons are performed with simple regularization parameters based on gene set cardinality. We obtain better results than both CLR and GENIE3 in terms of Area Under the PrecisionRecall curves, even with ND deconvolved networks. BRANE Cut yields stateoftheart results, with a negligible computation time. While the GENIE3 method needs about five hours to obtain a 4345gene network, limited to interactions involving transcription factors, we obtain a network with similar accuracy with our method in a few seconds, only using CLR weights computed in about forty minutes. This graph inference acceleration is thus useful to explore large datasets. Some predictions specifically identified by our method indeed appear as relevant interactions.
As mentioned in [2], communitybased methods represent a promising future for gene network inference, by aggregating the predictions of existing GRNs approaches. As our method takes any weights as an input, it has the potential to improve other GRN approaches providing pairwise weights. Results provided in the Additional file 1 illustrate these remarks.
Based on these assessments, a perspective consists of the integration of various weights, provided by competing GRN methods, to further improve and strengthen present results. This integration may involve multivalued graphs or network fusion [31].
Notes
Acknowledgements
We thank V. A. HuynhThu for her help in our comparisons with GENIE3, and Bruno Léty for careful proofreading. We are indebted to the reviewers for their insightful remarks which contributed to significantly improve the quality of our paper.
Supplementary material
References
 1.Rapaport F, Zinovyev A, Dutreix M, Barillot E, Vert JP. Classification of microarray data using gene networks. BMC Bioinformatics. 2007; 8(1):35.CrossRefPubMedPubMedCentralGoogle Scholar
 2.Marbach D, Costello JC, Küffner R, Vega NM, Prill RJ, Camacho DM, et al.Wisdom of crowds for robust gene network inference. Nat Methods. 2012; 9(8):796–804.CrossRefPubMedPubMedCentralGoogle Scholar
 3.Thomas SA, Jin Y. Reconstructing biological gene regulatory networks: where optimization meets big data. Evol Intell. 2013; 7(1):29–47.CrossRefGoogle Scholar
 4.Zhang X, Liu K, Liu ZP, Duval B, Richer JM, Zhao XM, et al.NARROMI: a noise and redundancy reduction technique improves accuracy of gene regulatory network inference. Bioinformatics. 2013; 29(1):106–13.CrossRefPubMedGoogle Scholar
 5.Butte AJ, Kohane IS. Mutual information relevance networks: Functional genomic clustering using pairwise entropy measurements. In: Pac. Symp. Biocomputing. vol. 5. Hawaii, HI, USA. p. 415–29.Google Scholar
 6.Margolin AA, Nemenman I, Basso K, Wiggins C, Stolovitzky G, Dalla Favera R, et al.ARACNE: an algorithm for the reconstruction of gene regulatory networks in a mammalian cellular context. BMC Bioinformatics. 2006; 7(Suppl. 1)(5):7.CrossRefGoogle Scholar
 7.Meyer PE, Kontos K, Lafitte F, Bontempi G. Informationtheoretic inference of large transcriptional regulatory networks. EURASIP J Bioinformatics Syst Biol. 2007; 2007:1–9.CrossRefGoogle Scholar
 8.Faith JJ, Hayete B, Thaden JT, Mogno I, Wierzbowski J, Cottarel G, et al.Largescale mapping and validation of Escherichia coli transcriptional regulation from a compendium of expression profiles. PLoS Biol. 2007; 5(1):54–66.CrossRefGoogle Scholar
 9.Schäfer J, Strimmer K. A shrinkage approach to largescale covariance matrix estimation and implications for functional genomics. Stat Sinica. 2005;4(1).Google Scholar
 10.Charbonnier C, Chiquet J, Ambroise C. WeightedLasso for structured network inference from time course data. Stat Appl Genet Mol Biol.2010;9(1).Google Scholar
 11.Krouk G, Mirowski P, LeCun Y, Shasha DE, Coruzzi GM. Predictive network modeling of the highresolution dynamic plant transcriptome in response to nitrate. Genome Biol. 2010; 11(12):123.CrossRefGoogle Scholar
 12.Feizi S, Marbach D, Médard M, Kellis M. Network deconvolution as a general method to distinguish direct dependencies in networks. Nat Biotechnol. 2013; 31(8):726–33.CrossRefPubMedPubMedCentralGoogle Scholar
 13.HuynhThu VA, Irrthum A, Wehenkel L, Geurts P. Inferring regulatory networks from expression data using treebased methods. PLoS One. 2010; 5(9):1–10.CrossRefGoogle Scholar
 14.HuynhThu VA, Sanguinetti G. Combining treebased and dynamical systems for the inference of gene regulatory networks. Bioinformatics. 2015; 31(10):1614–1622.CrossRefPubMedPubMedCentralGoogle Scholar
 15.Hecker M, Lambeck S, Toepfer S, van Someren E, Guthke R. Gene regulatory network inference: Data integration in dynamic models—a review. Biosystems. 2009; 96(1):86–103.CrossRefPubMedGoogle Scholar
 16.Kolmogorov V, Rother C. Minimizing nonsubmodular functions with graph cutsa review. IEEE Trans Pattern Anal Mach Intell. 2007; 29(7):1274–9.CrossRefPubMedGoogle Scholar
 17.Parikh JR, Xia Y, Marto JA. Multiedge gene set networks reveal novel insights into global relationships between biological themes. PLoS One. 2012; 7(9):1–15.CrossRefGoogle Scholar
 18.Sugiyama M, Azencott CA, Grimm D, Kawahara Y, Borgwardt KM. Multitask feature selection on multiple networks via maximum flows. In: Proc. SIAM Int. Conf. Data Mining. Philadelphia, PA, USA: 2014. p. 199–207.Google Scholar
 19.Chiquet J, Smith A, Grasseau G, Matias C, Ambroise C. SIMoNe: Statistical Inference for MOdular NEtworks. Bioinformatics. 2009; 25(3):417–8.CrossRefPubMedGoogle Scholar
 20.EspinosaSoto C, Wagner A. Specialization can drive the evolution of modularity. PLoS Comput Biol. 2010; 6(3):1000719.CrossRefGoogle Scholar
 21.Canny J. A computational approach to edge detection. IEEE Trans Pattern Anal Mach Intell. 1986; 8(6):679–98.CrossRefPubMedGoogle Scholar
 22.Ollion J, Cochennec J, Loll F, Escude C, Boudier T. TANGO: a generic tool for highthroughput 3D image analysis for studying nuclear organization. Bioinformatics. 2013; 29(14):1840–1.CrossRefPubMedPubMedCentralGoogle Scholar
 23.Huber P. Robust Statistical Procedures, 2nd edn. Philadelphia: Society for Industrial and Applied Mathematics; 1996.CrossRefGoogle Scholar
 24.Ford Jr LR, Fulkerson DR. Maximal flow through a network. Canad J Math. 1956; 8:399–404.CrossRefGoogle Scholar
 25.Kolmogorov V, Zabih R. What energy functions can be minimized via graph cuts?IEEE Trans. Pattern Anal. Mach. Intell. 2004; 26(2):147–59.CrossRefPubMedGoogle Scholar
 26.Boykov Y, Kolmogorov V. An experimental comparison of mincut/maxflow algorithms for energy minimization in vision. IEEE Trans Pattern Anal Mach Intell. 2004; 26(9):1124–37.CrossRefPubMedGoogle Scholar
 27.Marbach D, Prill RJ, Schaffter T, Mattiussi C, Floreano D, Stolovitzky G. Revealing strengths and weaknesses of methods for gene network inference. Proc Nat Acad Sci USA. 2010; 107(14):6286–91.CrossRefPubMedPubMedCentralGoogle Scholar
 28.Salgado H, GamaCastro S, PeraltaGil M, DíazPeredo E, SánchezSolano F, SantosZavaleta A, et al.RegulonDB (version 5.0): Escherichia coli K12 transcriptional regulatory network, operon organization, and growth conditions. Nucleic Acids Res. 2006; 34(Database issue):394–7.CrossRefGoogle Scholar
 29.Franceschini A, Szklarczyk D, Frankild S, Kuhn M, Simonovic M, Roth A, et al.String v9.1: proteinprotein interaction networks, with increased coverage and integration. Nucleic Acids Res. 2013; 41(D1):808–15.CrossRefGoogle Scholar
 30.Keseler IM, Mackie A, PeraltaGil M, SantosZavaleta A, GamaCastro S, BonavidesMartínez C, et al.Ecocyc: fusing model organism databases with systems biology. Nucleic Acids Res. 2013; 41(D1):605–12.CrossRefGoogle Scholar
 31.AbuJamous B, Fa R, Roberts DJ, Nandi AK. Paradigm of tunable clustering using binarization of consensus partition matrices (BiCoPaM) for gene discovery. PLoS One. 2013;8(2).Google Scholar
Copyright information
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.