Background

Deciphering the structure of the gene regulatory networks (GRNs) [1] is crucial for bioinformatics, as it provide insight on the development, functioning and pathology of biological organisms. With the advent of high-throughput technologies such as next-generation sequencing, it has become relatively easy to measure chromatin state and gene expression genome-wide. Gene expression data obtained from high-throughput technologies correspond to the expression profiles of thousands of genes, which reflect gene expression levels for different replicates or experimental conditions (e.g., physicochemical, temporal and culture medium conditions). As a consequence, many methods have been proposed to solve the GRN reverse engineering problem by using gene expression data [25].

However, inferring the GRN from gene expression data remains a daunting task due to the large number of potential interactions, the small number of available measurements and the high dimensional, noisy data. Methods based on the statistical analysis of dependencies have been applied to the inference of GRNs, such as the method proposed in [6], which uses correlation coefficients to define the gene similarity metric for inferring the GRNs. One weakness of this method is that correlation coefficients fail to identify more complex statistical dependencies (e.g., non-linear ones) between genes. Thus, information theoretic measures have been proposed to capture more complex dependencies. In particular, these methods use mutual information (MI) between a pair of genes as a measure to infer networks [7]. As the existence of indirect interactions in relevance network, some refinements have been proposed to correct the predictions. For example, the CLR method [8] eliminates indirect influences based on the empirical distribution of all mutual information scores. The ARACNE method [9] was also designed to filter out indirect interactions by using the Data Processing Inequality. C3NET [10] and its extension BC3NET [11] correct the predictions based on estimates of mutual information values in conjunction with a maximization step. The ANOVerence method [12] includes meta-information of the microarray chips to guide the network inference process and uses η 2 score as an alternative measure to evaluate dependencies between genes, where η 2 score is a correlation coefficient derived using ANOVA.

The methods [1315] based on probabilistic graphical models (e.g., Bayesian networks) have been widely used to infer GRNs. However, Bayesian networks do not allow the presence of feedback loops. Dynamic Bayesian networks [16, 17] are able to overcome the limitation while they can only handle time-series expression data. Moreover, learning the structure of a Bayesian network is a daunting task both from a computational and theoretical point of view [18]. Comparisons of existing GRN inference methods and detailed reviews can be referred in [4, 19].

Recently, some ensemble methods [18, 2022] formalized the GRN inference problem as a feature selection problem and show interesting performance. The GENIE3 method [18], which is based on feature selection with ensembles of random forests, is recognized as state-of-the-art on some benchmarks [19]. As using random forests for feature selection is not well understood theoretically, the TIGRESS method [20] uses least angle regression (LARS) with stability selection combined to solve the GRN inference problem. The ENNET method [21] aggregates the features selected by an algorithm based on Gradient Boosting Machine. However, the ENNET method has high computational cost when it is applied on the high-dimensional data (i.e., the data with thousands of features). The NIMEFI method [22] explores the potential of several ensemble methods, such as GENIE3, Ensemble Support Vector Regression (E-SVR) and Ensemble Elastic Net [23] (E-EL), and combines the predictions of these methods under a general framework. However, NIMEFI has more adjustable parameters than other ensemble GRN inference methods, which increases the uncertainties of the model.

In this paper, we propose a new ensemble GRN inference method based on partial least squares (PLS). The method casts PLS-based feature selection algorithm into an ensemble setting by taking random potential regulatory genes. Then, we use a statistical technique to refine the predictions in our method by taking into account the impact of hub regulatory gene (i.e., a regulatory gene regulates many target genes). Various evaluations of techniques have been performed in the context of DREAM (Dialogue for Reverse Engineering Assessments and Methods) challenges [24], which aims to provide researchers with benchmark datasets to validate their work. Hence, we compare the performance of our method to several state-of-the-art methods in DREAM4 [25, 26] and DREAM5 [27] gene reconstruction challenge, and the results show our method performs competitively.

Methods

Problem definition

We focus on inferring the directed topology of GRNs using gene expression data in this paper. As input data, we consider gene expression measurements for p genes in n experimental conditions. The same as many ensemble methods (e.g., GENIE3, TIGRESS, ENNET and NIMEFI), we use a general framework for GRNs inference problem, which does not take the information of different experimental conditions (e.g., gene-knockouts, perturbations and even replicates) into account. The gene expression data D is defined as following:

$$ D\kern0.5em =\kern0.5em \left[{x}_1,\dots, {x}_p\right]\in {R}^{n\times p} $$
(1)

where x i is a column vector of expression values of i-th gene in n experimental conditions.

GRN inference methods aim to make a prediction of the regulatory links between genes from gene expression data D. Most methods provide a ranking list of the potential regulatory links from the most to the less confident. Then, a network is automatically obtained by selecting a threshold value on this ranking. As it is beneficial to the end-user to explore the network at all sorts of threshold levels [22], we focus only the ranking task in this paper. It should be noted that the ranking is the standard prediction format of the DREAM challenges, where the challenges have been widely used to evaluate various GRN inference methods.

In order to infer the regulatory network from the expression data D, we compute a score w ij for a potential edge directed from gene i to gene j, where the edge indicates that gene i regulates gene j on expression level and the score w ij represents the strength that gene i associates (i.e., regulates) gene j.

Network inference with feature selection methods

Motivated by the success of ensemble methods based on feature selection (e.g., GENIE3 and TIGRESS), we decomposed the GRN inference problem with p genes into p subproblems, where each of these subproblems can be viewed as a problem of feature selection in statistics [18, 28]. More specifically, for each target gene, we wish to determine the subset of genes which directly influence the target gene from the expression level. Let D is the gene expression data defined in (1), the i-th gene is the target gene, and we define candidate regulators containing expression values in n experimental conditions as:

$$ {x}^{-i}\kern0.5em =\kern0.5em \left[{x}_1,\dots, {x}_{i-1},{x}_{i+1},\dots, {x}_p\right] $$
(2)

and the feature selection problem can be defined as:

$$ {x}_i\kern0.5em =\kern0.5em f\left({x}^{-i}\right)\kern0.5em +\kern0.5em \varepsilon,, \forall i\in \left\{1,2,\dots, p\right\} $$
(3)

where f is a regression function exploits the expression in x − i of genes that are directly connected to gene i, and ε is some noise. Usually, f can be defined as:

$$ f\left({x}^{-i}\right)\kern0.5em =\kern0.5em {\varSigma}_j{w}_{ji}{x}_j,\ \forall j\in \left\{1,\dots, i-1,i\kern0.5em +\kern0.5em 1,\dots, p\right\} $$
(4)

where w ji  ≥ 0 represents the strength that gene i associates (i.e., regulates) gene j. The rankings of the regulatory links of gene i is obtained by computing the w ji . By aggregating the p individual gene rankings, we can get a global ranking of all regulatory links.

GRN inference with PLS-based ensemble methods

Recently, as PLS (Partial Least Squares) has been exploited by several authors to address the problem of feature selection for classification and showed interesting performance, such as TotalPLS [29] and KernelPLS [30], in this paper, we also use the PLS based method to solve the problem defined in (3). One difficulties of GRN inference problem is that we do not know how many candidate regulatory genes are sufficient to provide a good model for a target gene. For the purpose, we use PLS-based ensemble method. The basic idea is that the w ji for gene i is computed by running PLS-based feature selection method many times, resampling the samples and selecting random K candidate regulatory genes at each run. We discuss and explore the effect of K values on the method performance in the Results Section.

Feature selection with PLS-based method

Let X = [x 1, …, x p ] ∈ R n × p be a matrix that has been normalized to have a mean of zero and Y = [y 1, …, y n ]T ∈ R n × 1 be a column vector that has been normalized to have a mean of zero. PLS aims to find a pair of projection directions w and u such that the projections P = Xv (i.e., PLS components) and Q = Yu can carry as much information on variation as possible in X and Y [31]. The projections P and Q can be obtained by solving the criterion function as:

$$ \left\{\begin{array}{c}\hfill max\kern0.5em J\left(v,\kern0.5em u\right)\kern0.5em =\kern0.5em \frac{{\left({v}^T{\Sigma}_{\mathrm{XY}}u\right)}^2}{v^Tv\cdot {u}^Tu}\hfill \\ {}\hfill s.t.\kern0.5em {v}^Tv={u}^Tu=1.\hfill \end{array}\right. $$
(5)

where Σ XY  = cov 2(X, Y) is the covariance matrix for the vectors of X and Y.

The common solutions to PLS-based model include Non-linear Iterative PLS (NIPALS) [32] and Statistically Inspired Modification of PLS (SIMPLS) [33]. As SIMPLS is slightly superior to NIPALS and is computationally efficient, our analysis and calculation is based on SIMPLS in this paper. PLS components P are constructed to maximize the objective function based on the sample covariance between Y and P = Xv. Let m be the number of components, SIMPLS is able to calculate v 1, v 2, …, v m by solving the objective function as follow:

$$ \left\{\begin{array}{c}\hfill maxJ(v)\kern0.5em =\kern0.5em co{v}^2\left(X{v}_i,\kern0.5em Y\right)\hfill \\ {}\hfill s.t. \parallel {v}_i\parallel \kern0.5em =\kern0.5em 1;\hfill \\ {}\hfill {v}_i^T\left({X}^TX\right){v}_j\kern0.5em =\kern0.5em 0;\hfill \\ {}\hfill j\kern0.5em =\kern0.5em 1,\dots, i\kern0.5em -\kern0.5em 1.\hfill \end{array}\right. $$
(6)

The component P i  = Xv i , which extracts from the SIMPLS calculation, represents as much variation information of X as possible. To explain the information of Y, the component should be associated with Y as much as possible. In order to analyze the explanation of variation of X to Y, the variable importance in projection (VIP) [34] is introduced to quantitatively denote the impact of i-th feature to Y.

Definition VIP: Let r(⋅,⋅) be the correlation coefficient between two variables. The VIP is defined as:

$$ VIP\left({x}_j\right)\kern0.5em =\kern0.5em \sqrt{p\frac{{\displaystyle {\sum}_{i=1}^m}\psi \left(Y;{t}_i\right){v}_{ji}^2}{{\displaystyle {\sum}_{i=1}^m}\psi \left(Y;{t}_i\right)}} $$
(7)

where ψ(Y; t i ) = r 2(Y, P i ) is the explanation of variation of component P i to Y, p is the number of features and v ji is the weight of the j-th feature for the i-th component. The larger value of VIP(x j ) is, the more explanatory power of x j to Y.

The pseudo code of PLS-based feature selection is presented in Method 1.

figure a

Refining the inferred regulatory network

In our method, we use a statistical technique to refine the inferred regulatory network in our method. The final result is improved under the assumption that if a regulatory gene regulates many target genes (e.g., the regulatory gene is hub node), it is an important regulatory gene. Once the solution of the gene regulatory network inference is calculated, we can obtain an adjacency matrix W, where W ij represents the strength that gene i associates (i.e., regulates) gene j. Regulatory genes are scored based on their impacts on multiple target genes. An updated adjacency matrix W is given as:

$$ W\left(i,:\right)\kern0.5em =\kern0.5em W\left(i,:\right)*{\sigma}_i^2,\forall i\in \left\{1,2,\dots, p\right\} $$
(8)

where W(i, :) is the i-th row of W, and σ 2 i is a variance in the i-th row of W. It should be noted that each row of W is calculated in a subproblem of our method. Each row of W contains relative scores with respect to a different target gene. Therefore, if a regulatory gene regulates many target genes, the variance in a row of W corresponding to that regulatory gene is elevated.

The pseudo code of PLS-based ensemble method (PLSNET) is presented in Method 2.

figure b

Parameter settings

The main parameters of PLSNET are the number of components m and the number of candidate regulatory genes K. Parameter selection (i.e., the selection of the m variable) for the PLS model is a difficult task due to the fact that if m is too large, there will be over-fitting in the model and if m is too small, there will be under-fitting in the model. There are two widely used methods for PLS parameter tuning, specification and cross validation (CV). The drawback to CV is that it significantly increases the computation cost and the problem to a certain extent becomes even more difficult to handle. The specification method usually fixes the value of m, typically, the value is not larger than 5. Since we do not know how many candidate regulatory genes are sufficient to provide a good model for a target gene, the choice of K may not be trivial.

In this paper, we evaluated our method PLSNET on two popular benchmarks: DREAM4 multifactorial datasets and DREAM5 datasets. For DREAM4 multifactorial datasets, we use CV to set two main parameters of PLSNET, where m is chosen from {1, 2, ⋯, 5} and K is chosen from {5, 10, ⋯, 100}. And we choose the parameter setting (m = 5, K = 30) as default values. As the size of DREAM5 datasets is much larger than that of DREAM4 multifactorial datasets, it is difficult to utilize CV to choose the parameters due to the fact that it would significantly increase the computation cost. Instead, we utilize the specification method to set m = 5. And following the suggestion of GENIE3 [18], we set \( K=\sqrt{p} \) as default value for DREAM5 datasets.

Computational complexity

As shown in Method 2, there are two main parts in PLSNET, including calculating the score of each edge and refining the inferred network. Consider N × P matrix X and N × 1 matrix Y, SIMPLS is O(mNP) complex. Here, m is the number of components, N is number of samples and P is the number of genes. Another part of PLSFS (i.e., VIP) is also O(mNP) complex. Hence, the computational complexity of PLSFS is O(mNP) and we calculate the score of each edge in an O(mTKNP) time, where K is the number of candidate regulatory genes and T is the number of iterations. PLSNET’s complex is thus on the order of O(mTKNP + P 2). In practice, the dominating part of the sum is mTKNP and the value of m is not larger than 5, we therefore report a final computational complexity of PLSNET as O(TKNP). We compare our method with other inference methods in Table 1. It should be noted that the calculation of the mutual information matrix is not included for information-theoretic methods (i.e., CLR and ARACNE).

Table 1 The computational complexity of different GRN inference methods

Results

In recent years, the problem of evaluating performance of the inference methods on adequate benchmarks has been widely investigated [24, 35]. The most popular benchmarks, such as S. cerevisiae [36], E. coli [37] and artificially simulated in silico networks [24, 3840], are derived from well-studied in vivo networks of model organisms. One weakness of in vivo benchmark networks is that no matter how well the model organism is studied, experimentally confirmed pathways can never be assumed complete [21]. As such networks are assembled from known transcriptional interactions with strong experimental support, the gold standard networks are expected to have few false positives. Given a gene expression data matrix, a GRN inference method outputs a ranked list of putative regulatory interactions. Taking the top L predictions in this list, we can compare them to known regulations (i.e., the gold standard networks) to evaluate the performance of the GRN inference method.

In this paper, we used several popular benchmark GRNs to evaluate the accuracy of our proposed method and compare it with the other inference methods. The datasets we used in our experiments are from DREAM challenges and the details of the datasets are summarized in Table 2. The first three networks come from the DREAM5 challenge. Network 1 (in-silico) is a simulated network with simulated expression data, while the other two expression datasets are real expression data collected for E. coli (Network 3) and S. cerevisiae (Network 4). It should be noted that we do not use Network 2 of DREAM5 in our experiments for the reason that there is no verified interaction provided for this dataset. In order to assess the ability of our method to predict directionality, we used the five DREAM4 size 100 Multifactorial Networks in our experiments, where the regulatory genes are not known in advance for these networks.

Table 2 Datasets

In fact, DREAM4 and DREAM5 datasets have been widely used for several GRNs inference methods to evaluate the performance recently. For example, the authors of TIGRESS [20] compared the performance of some GRNs inference methods on DREAM4 Multifactorial Networks and DREAM 5 Networks in 2012. In the same year, the authors of ANOVerence [12] presented the results of several GRNs inference methods performed on DREAM5 Networks. In 2014, the performance comparisons of many GRNs inference methods on DREAM4 Multifactorial Networks and DREAM 5 Networks were shown in NIMEFI [22].

We evaluated the accuracy of the methods using the Overall Score metric proposed by the authors of DREAM challenges [24], as shown in the following:

$$ Overall\ Score\kern0.5em =\kern0.5em -\kern0.5em \frac{1}{2}lo{g}_{10}\left(P\_ AUPR\cdot P\_ AUROC\right) $$
(9)

where P_AUPR and P_AUROC are respectively the geometric means of p-values taken over the networks from DREAM challenges, relating to an area under the precision-recall curve (AUPR) and an area under the receiver operating characteristic curve (AUROC). The probability densities of DREAM Network data which are used to calculate the p-values and the respective gold standard networks are provided on DREAM web site.

Performance evaluation

We compare the performance of our method PLSNET with five of the most prominent GRN inference methods, GENIE3 [18], TIGRESS [20], CLR [8], ARACNE [9] and NIMEFI [22], that are widely used in the literature. Moreover, the top three performers in each of DREAM challenges as listed on the DREAM web site are also selected for comparison. We use the Matlab implementations of GENIE3 and TIGRESS, while ARACNE and CLR are run in the minet R package [41]. NIMEFI is implemented using the R package available for download at http://bioinformatics.intec.ugent.be/. The Matlab code of PLSNET is included in Additional file 1. We keep default parameter values for each of these methods and set the number of iterations T = 1000 for ensemble methods (i.e., GENIE3, TIGRESS, NIMEFI and PLSNET).

Performance on the DREAM4 multifactorial datasets

The goal of the In Silico Size 100 Multifactorial challenge of DREAM4 was to infer five networks from Multifactorial perturbation data, where each of them contained 100 genes and 100 samples. Multifactorial perturbation data are defined as gene expression profiles resulting from slight perturbations of all genes simultaneously. The topology of these benchmark networks were derived from the transcriptional regulatory system of S. cerevisiae and E. coli.

Each DREAM4 Multifactorial Network data is a 100 × 100 matrix, where each column represents a gene and each row represents a different experimental condition (i.e., perturbation). The values in the matrix are the expression values of the genes on the respectively experimental conditions. In our experiments, all compared GRNs inference methods used these matrices as the input data and the results are shown in Table 3.

Table 3 Performance comparisons of different GRN inference methods on the DREAM4 networks, challenge size 100 Multifactorial

Table 3 lists the results of PLSNET with default parameter setting (m = 5, K = 30) compared with those of other GRN inference methods on the DREAM4 multifactorial datasets. Without further optimization of the parameters on these networks, PLSNET achieves the best Overall Score. And PLSNET shows particularly strong performance on Network2 and Network4, improving over other GRN inference methods in terms of AUPR and AUROC.

Influence of parameters

In this section, we provide more details about the influence of the parameters of compared methods on performance, taking five DREAM4 Multifactorial Networks as benchmark datasets.

Figure 1 summarizes the Overall Score of three compared methods (PLSNET, GENIE3 and NIMEFI) for different number of candidate regulatory genes K on the DREAM4 multifactorial datasets. As seen in Fig. 1, the range of K values leading to the best performance is narrow with our proposed method, and therefore it is difficult to find an appropriate value of K as default value in advance.

Fig. 1
figure 1

Boxplots of Overall Score on DREAM4 Multifactorial Networks with respect to the number of candidate regulatory genes

Figure 2 shows the Overall Score of our method for different number of components m with K = 30 on the DREAM4 multifactorial datasets. We observe in Fig. 2 that the Overall Score is not very sensitive to the choice of the number of components, and therefore one may practically more easily tune it for optimal performance.

Fig. 2
figure 2

Boxplots of Overall Score on DREAM4 Multifactorial Networks with respect to the parameter of PLSNET

Figure 3 shows the influence of two main parameters (α and L) of TIGRESS on the Overall Score using DREAM4 Multifactorial datasets, where α ∈ [0, 1] controls the random re-weighting in each stability selection run and L is the number of LARS (Least Angle Regression) steps. The Overall Score of ARACNE for different kernel widths on DREAM4 Multifactorial Networks is shown in Fig. 4.

Fig. 3
figure 3

Boxplots of Overall Score on DREAM4 Multifactorial Networks with respect to the parameters of TIGRESS

Fig. 4
figure 4

Boxplots of Overall Score on DREAM4 Multifactorial Networks with respect to the parameter of ARACNE

Performance on the DREAM5 datasets

The three DREAM5 datasets were structured with respect to different model organisms, and were different in size. The expression data of the only one network (Network1) were simulated in silico, while two other sets of expression data were measured in real experiments in vivo. As in all DREAM challenges, in silico expression data were simulated using an open-source GeneNetWeaver simulator [25]. The gold standard networks of DREAM5 were mainly obtained from two sources: Gene Ontology (GO) annotations [42] and RegulonDB database [36].

Each DREAM5 Network data contain three files: network chip features, network transcription factors and network expression data. The file of network chip features records the details of each experimental condition in network expression data, which contain time series, perturbations and even replicates. However, as mentioned in Section 2.1, we do not use the information for inferring GRNs. And the methods compared in our experiments do not use the information as well. The file of network transcription factors records the genes that have been verified to be regulatory genes. Typically, the number of regulatory genes is used as a parameter for GRNs inference methods to construct the model. The file of network expression data contains a n × p matrix, where n represents the number of experimental conditions and p is the number of genes, and the values in the matrix are the expression values of the genes on the respectively experimental conditions. In our experiments, all compared methods used these matrices as the input data and the results are given in Table 4.

Table 4 Performance comparisons of different GRN inference methods on the DREAM5 networks

Table 4 summarizes the results of PLSNET with default parameter setting (\( m=5,\ K=\sqrt{p} \)) compared with those of other GRN inference methods on the DREAM5 datasets. As seen in Table 4, PLSNET achieves the best Overall Score, as well as the best individual AUROC scores for Network 1 and Network 4. ANOVerence achieved the best performance on the E. coli network (Network 2), as it does include meta-information of the microarray chips to guide the network inference process.

Since the number of regulatory genes on DREAM5 datasets is much larger than that of on DREAM4 datasets, it is more difficult to set the number of candidate regulatory genes K. In our experiments, we set \( K=\sqrt{p} \) and observed that our method perform well in this setting. However, it should be noted that better results could be obtained if K is set to other values.

Obviously, all GRN inference methods achieved better scores for an in silico network (Network 1) than for other two in vivo networks. One main reason for a poor performance of the inference methods for in vivo networks may be that experimentally confirmed pathways, and the gold standards derived from them, cannot be assumed completely. On the other hand, in silico datasets provide enough information to confidently reverse-engineer their underlying structure.

CPU time

In our experiments, ARACNE, CLR and NIMEFI were implemented using the R package, while GENIE3, TIGRESS and our method PLSNET were run in Matlab. As PLSNET is an ensemble method, we focus on the running time of ensemble methods rather than other GRN inference methods. On the other hand, ensemble methods usually achieve better results than other GRN inference methods.

Table 5 gives an overview of the running times of some of the GRN inference methods. These measurements were conducted using Matlab (R2010a edition), an Intel Core (TM) i5-3317U, clocked at 1.70 GHz, 4.00 GB of RAM memory and a 64-bit Microsoft Windows 7 operating system. Note that we do not include NIMEFI for comparison due to the fact that NIMEFI is a method using multiple ensembles of GRN inference methods, including GENIE3, Ensemble Elastic Net and Ensemble Support Vector Regression.

Table 5 Comparisons of running times of different GRN inference methods

As can be seen from the table, in terms of computational efficiency, PLSNET performs best on DREAM5 networks and performs the second best on DREAM4 networks. GENIE3 performs best on DREAM4 networks as the size of the datasets is small. However, GENIE3 is more time consuming than PLSNET when it is implemented on the big datasets.

Conclusions

In this paper, we presented PLSNET, a new ensemble method for GRN inference. PLSNET expresses the GRN inference problem as a feature selection problem, and solves it with the PLS-based feature selection method combined with a statistical technique for refining the predictions. The influence of PLSNET parameters was clarified in this paper, and we showed that further improvement may result from finer parameter tuning.

Different from other ensemble methods, such as GENIE3 and TIGRESS, PLSNET aggregates the features selected by PLS-based method. Moreover, considering that if a regulatory gene regulates many target genes (e.g., a regulatory gene is a hub node), it indicates an important regulator gene; we use a statistical technique to refine the inferred network in our method.

We evaluated our proposed method on the DREAM4 multifactorial and DREAM5 benchmarks and achieved higher accuracy than other state-of-the-art methods. Furthermore, among ensemble GRN inference methods, our method is computationally efficient.