Background

The regulation of gene expression is a critical mechanism in the control of biological processes in cellular organisms. At the transcriptional level, the main regulators contributing to the control are transcription factors (TFs), proteins that bind to cis-regulatory elements in the gene promoter regions [1]. By activating or repressing their target genes, TFs can regulate the global gene expression program of a living cell, and form transcriptional regulatory networks [2-4].

Recent studies have identified that microRNAs (miRNAs) play an important role in gene regulation at the post-transcriptional level. The regulation process takes place via mRNA cleavage or translational repression, with miRNAs binding to the 3’-untranslated regions (3’-UTRs) of target mRNAs through base pairing to complementary sequences [5-8]. It has also been demonstrated in a body of literature that miRNAs regulate a wide range of biological processes in proliferation [9, 10], metabolism [11, 12], differentiation [13], development [14, 15], apoptosis [12, 16, 17], cellular signaling [18] and even cancer development and progression [7, 19, 20].

It is necessary to draw a unified picture for the regulatory relationships between TFs, miRNAs and genes. However, a challenge is that the combined regulations of miRNAs and TFs are complicated, since they involve not only the interactions between each regulator and their target genes, but also the interactions between the regulators themselves. Studies of the gene regulatory networks with the presence of both TFs and miRNAs will help elucidate the regulatory mechanisms involving both direct and indirect regulatory relationships. However, It is still highly unlikely for these relationships to be discovered by biological experiments directly, as the process would be extremely costly and time consuming. On the other hand, well-designed computational approaches may facilitate the understanding of such complex relationships.

Previously, researchers studied the co-regulation of TFs and miRNAs by finding out their shared downstream targets [21, 22]. The methods used probabilistic models or statistical tests to measure the significance of the shared targets between the regulators, and to remove the insignificant co-regulating interactions that occurred by chance. Gene enrichment analysis was used in [23] to identify significant co-regulation between the transcriptional and post-transcriptional layers. They found that some biological processes emerged only in co-regulation and that the disruption of co-regulation may be closely related to cancers, suggesting the importance of the co-regulation of miRNAs and TFs. In [23] available predicted targets databases are used to construct the network, and then Gene Ontology (GO) was used to discover the significant functional co-regulation pairs. Tran et al. [24] proposed a rule based method to discover the gene regulatory modules that consist of miRNAs, TFs, and their target genes based on the available predicted target binding information. Le Béchec et al. [25] integrated available target prediction databases to construct a regulatory network that involves miRNAs, TFs, and mRNAs. This work provides a good resource for exploring the regulatory relationships or identifying the network motifs. However, target prediction based on sequences have high rate of false discoveries, which affect the quality of the discoveries of the above mentioned methods. It would be ideal if expression data can be used to refine the discoveries.

Roqueiro et al. [26] proposed a method to identify the key regulators (miRNAs or TFs) of pathways. The method used Bayesian inference on known pathway structures to infer a set of regulators in the pathway network. The Bayesian network in this method was constructed manually using the known KEGG pathways by removing the cycles in the pathways and applying some filtering criteria. The method drew findings based on existing knowledge and provided a good resource for other methods to validate their results. However, it may not be good for exploratory study.

Recently, Huang et al. [27] developed a web tool (mirConnX) for constructing the regulatory networks that include miRNA, TFs, and mRNAs. The built networks can be further analysed to identify network motifs. The method has used both predicted targets and expression data to build the network. The method integrated the association network based on expression data and the prior network based on sequence data. However, an edge in this network shows association, which may not indicate a regulation relationship. A strong association of A and B may be a result of a common regulator which regulates both A and B. Zacher et al. [28] proposed a Bayesian inference method based on expression data to explain the activity of miRNAs and TFs. However, this approach does not take into account the interactions between miRNAs and TFs.

In this paper, we present a framework to construct the complex regulatory network with three components: TFs, miRNAs, and target genes. Our approach aims to discover the regulatory relationships of miRNAs and TFs on their target genes respectively, as well as the interplay between the two different types of regulators. The method utilises multiple sources of data, including gene expression data, target information of each regulator based on sequence data, and sample categories (conditions). To test the proposed method, we use the expression data from the NCI-60 panel of cell lines [29], and investigate the interactions that may involve in the biological process of epithelial-to-mesenchymal transition (EMT).

Methods

Notation and definitions

Consider three expression data sets profiling K miRNAs, I TFs, and J mRNAs across S samples, respectively. Let x={x k },y={y i },z={z j } be the vectors of miRNAs, TFs, and mRNAs, respectively, where 1≤kK,1≤iI, and 1≤jJ. Each sample is labelled by its category, i.e. the biological condition of the samples, such as cancer or normal.

In this paper, our goal is to discover the interactions between x,y,z (x and y are regulators) supported by the expression data and under the constraint of target information (see Figure 1). Target information for a regulator is the interactions between the regulator and the regulated genes that are predicted based on the sequence data. We are particularly interested in the interactions between x and y (called the interplay of miRNAs and TFs), and network motifs, which are patterns of subgraphs that recur at frequencies much higher than those found in randomised networks [2].

Figure 1
figure 1

Method overview. The method utilises Bayesian network learning and graph search to produce a three-component regulatory network (miRNA-TF-mRNA) from multiple sources of data. Target information is used to create the initial structure representing the interactions between miRNA-mRNA, miRNA-TF, TF-miRNA, TF-TF, and TF-mRNA. For illustration, we only draw one bipartite graph in the initial structure. Expression profiles are then used in the Bayesian network learning procedure to construct the networks in each sample condition. Bootstrapping and averaging procedure is used to integrate the Bayesian networks learnt from each condition into the integrated global network. The interplays between miRNAs and TFs and the network motifs that involve at least 2 regulators are extracted from the global network and are final results.

Method overview

In the remaining parts of the Methods section, we present our framework for constructing the regulatory network with the co-existence of both regulators, TFs and miRNAs. The method aims to produce regulatory networks including miRNAs, TFs, and genes that are relevant to diseases. The overall process is shown in Figure 1.

There are three main steps in the framework: (1) data preparation, (2) network learning and integration, and (3) network inferences. In Step (1), we prepare the input for the network structure learning, including collecting target information for TFs and miRNAs, normalising expression data, and analysing differentially expressed genes. At the beginning of Step (2), the target information is transformed into the 5 types of network sub-structures (miRNA-mRNA, miRNA-TF, TF-miRNA, TF-TF, and TF-mRNA), which are used as the initial structure for the Bayesian network learning process (refer to Figure 1). Additionally the expression datasets are split according to sample conditions. The initial structure are evaluated based on the expression profiles in each condition. The Bayesian networks learnt under each condition are integrated using a bootstrapping and averaging procedure. Therefore the result of Step (2) is an integrated global network with three components: miRNAs, TFs, and mRNAs. In the network inference step (Step (3)), we search the global network for the subgraphs that show the interplay between miRNAs and TFs, and network motifs that involve at least two regulators.

In the following, we describe each of the three steps in detail.

Step (1): Data preparation

Epithelial-to-mesenchymal transition (EMT) is part of the process of tissue remodeling during embryonic development and wound healing [30], and during carcinogenesis [31] when cancer cells undergo a change transforming into a more invasive tumor [30, 32].

After EMT induction, cells lose their epithelial features characterised by the high E-cadherin expression level, and acquire mesenchymal characteristics, including Vimentin filaments and a flattened phenotype. By expressing proteases, cells become more invasive, and they can pass through the underlying basement membrane and migrate. These are crucial steps in the multistep process of metastasis [33].

Data used in this study contain miRNA expression profiles for the NCI-60 panel of 60 cancer cell lines obtained from Sø et al. [34] (available at [http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE26375]). The mRNA expression profiles for NCI-60 are downloaded from ArrayExpress [http://www.ebi.ac.uk/arrayexpress], accession number E-GEOD-5720. Cell lines categorised as epithelial (11 samples) and mesenchymal (36 samples) are used in this work.

In order to identify all the TF genes in the data sets, we use the list of TF repertoire mined from [1]. This list is then used to query against the mRNA expression profiles from NCI-60 to extract TF expression profiles.

After normalising the expression data of miRNAs, TFs, and mRNAs, differentially expressed gene analysis is conducted respectively to all the three components, TFs, miRNAs, and mRNAs. The differentially expressed genes between epithelial and mesenchymal samples are identified using the limma package of Bioconductor [35] with the Benjamini and Hochberg’s (BH) correction method [36]. 148 probes of TFs, and 2251 probes of mRNAs are identified as differentially expressed at significant levels (adjusted p-value <0.1). Also 43 probes of miRNAs are identified with adjusted p-value <0.01. The reason for choosing adjusted 0.01 as the cut-off for miRNA differentially expressed analysis is that the B statistic value output from limma changes the value significantly between adjusted p-value <0.01 and adjusted p-value >0.01. This is a good sign for using the value of 0.01 as a cut-off, and the number of miRNAs obtained with this cut-off is also reasonable for analysing the results. (The details of differentially expressed genes are in Additional file 1).

The data input to the Bayesian network learning in the next step is the expression profiles of three components, miRNAs, TFs, and mRNAs. To integrate the data profiled from different platforms, we discretise the expression values of each gene in each sample to binary values (standing for up-regulation and down-regulation) by using the median of each array as the cut-off.

Another input to the Bayesian network learning is the target information, which is used as the prior knowledge (initial Bayesian network structure) to reduce the search space in the learning. miRNA targets and TF targets are collected via commonly used databases. These databases usually predict target genes using sequence data. In this paper, we are particularly interested in the information of TFs targeting both mRNA and miRNA genes, and the miRNAs targeting mRNA and TF genes. We use TRANSFAC 9.3 [37] and the promoter database [38] to generate TF target information. TF target information for TF-mRNA and TF-TF pairs is obtained from CRSD [39], the database utilising and integrating six well-known large scale databases, including TRANSFAC 9.3 and promoter databases. To obtain the TF-miRNA target information, we use MIR@NT@N downloaded from [25]. Meanwhile miRBase V5.0 [40] from the Bioconductor package RmiR.Hs.miRNA 2.11 is used to build the putative target for miRNAs. The detailed results of all target information are shown in the Additional file 2.

Step (2): Bayesian networks structure learning and integration

Bayesian network structure learning has been widely used for discovering gene-gene interaction networks [41], but not often for the discoveries of the interactions between regulators and their target genes. To represent the interactions between regulators (miRNAs and TFs) and between the regulators and their target genes in a Bayesian network, regulators and target genes are denoted by nodes and regulatory interactions are denoted by directed edges. When the expression data of regulators and target genes are given, we can use Bayesian network structure learning to discover the interactions. To start the learning process, we use the target information of regulators to initialise the search space. Therefore in this step, we take the expression profiles and target information as the input to produce a network that indicates the interactions between miRNA-TF, miRNA-mRNA, TF-miRNA, TF-TF, and TF-mRNA. The following four sub-steps are carried to obtain the network.

Step (2.1): Sample splitting

To explore all possible interactions including up-, down-, and mix- regulations (up-regulation in one condition and down-regulation in the other) in different biological conditions, in [42] we developed the “splitting and averaging” strategy for Bayesian networks structure learning. This strategy splits samples in a data set by their categories of biological conditions. Bayesian network structure learning is used to learn the networks from the samples of each condition respectively, and these networks are then integrated or averaged into a single network. In our current problem, we firstly use this strategy to learn the networks for the epithelial and mesenchymal conditions separately, then obtain the integrated network from the networks learnt under the two conditions. So in this sub-step, we split each of the three expression datasets according to sample conditions, 11 samples in epithelial and 36 samples in mesenchymal (conditions 1 and 2 in Figure 1 respectively).

Step (2.2): Creating the initial structure

To learn a Bayesian network with n variables or nodes, the search space, if not restricted, will be all the possible networks formed with the variables. It has been shown that finding the best network from all the networks is NP-hard [43]. Therefore in this paper, we assume that the regulatory relationships between regulators and their target genes form a bipartite graph. This assumption reduces the search space significantly. Moreover, we use target information to initialise the network structure and the topology of the network structure is further constrained. Therefore, we are able to discover the optimal solutions using the exhaustive search method in the given search space. Graphically, the target information can be represented using bipartites as illustrated in Figure 1. There are 5 types of such bipartites or sub-structures, corresponding to our initial knowledge of the interactions of miRNA-TF, miRNA-mRNA, TF-TF, TF-miRNA and TF-mRNA. These bipartites are used as the initial structure for the Bayesian network learning.

Step (2.3): Bayesian network learning process

Each interaction in the initial structure is evaluated based on the expression data, and the high-confidence interactions are retained. The learning process searches through all possible candidate structures and evaluates the interactions with a Bayesian scoring function. The candidate structures are generated by removing edges from the initial structure one by one. The scoring function measures the degree of fitness of any candidate structure G to the dataset. The goal is to select the structure that best fits the data. In other words, we need to calculate the probability of each candidate structure G given the data D, P(G|D). According to Bayes’ theorem, we have:

P ( G | D ) = P ( G ) P ( D | G ) P ( D )

In the above formula, the term P(D) is the same for all candidate structures. Regarding the term P(G), it is quite common to assume a uniform distribution [44], and thus it is a constant. Therefore, for comparative purposes, it is sufficient to calculate only P(D|G) for the scoring function. In this paper, we use the BDe (Bayesian metric with Dirichlet priors and equivalent) scoring function developed by Heckerman et al. [45] in the following.

Score B ( D , G ) = P ( D | G ) = i = 1 n j = 1 q i ( G ) Γ ( N ij ( G ) ) Γ ( N ij ( G ) + M ij ( G ) ) × k = 1 r i Γ ( a ijk ( G ) + s ijk ( G ) ) Γ ( a ijk ( G ) )

where:

n is the number of variables (nodes), X1,X2,…,X n ,

q i is the number of different instantiations of the parent of a variable X i in G,

r i is the number of possible values of X i in G,

a ijk ( G )

are the hyperparameters for the Dirichlet prior distributions of the parameters given the network structure.

s ijk ( G )

are the corresponding observations from data,

N ij ( G ) = k a ijk ( G )

, and M ijk ( G ) = k s ijk ( G ) .

More details of the Bayesian scoring function can be found in [45, 46]. In practice, we use the Bayes Net toolbox for Matlab (BNT) [47], and the BDe scoring function is included in BNT. In the next step (Step (2.4)) we evaluate the confidence levels of the interactions output from the Bayesian network structure learning.

For illustration purpose, consider the learning procedure for the interaction between a regulator R1 and its target gene G1. Assume that in total R1 has two targets and the corresponding initial structure is in Figure 2a. The interactions in each of the four possible candidate structures (see Figure 2b) can be evaluated with the scoring function based on expression data. The scores will decide if there is no edge between R1 and G1 or an edge between them. In this example, there are two structures with an edge between R1 and G1, and two structures with no edge between them. The average score in each of the two cases is calculated and the structure with higher average score ((-4.6-6.2)/2=-5.4) is chosen (Figure 2c).

Figure 2
figure 2

An example of Bayesian network structure learning. (a) The initial structure corresponding to regulator R1 and target gene G1. This initial structure is created based on target information of the regulator. (b) All possible candidate structures generated from the initial structure. The interaction between R1 and G1 in each candidate structure is evaluated using a scoring function. (c) The presence or absence of an edge between R1 and G1 based on the average score of each case. The candidate structure with higher average score will be chosen.

Step (2.4): Integrating and bootstrapping

It is common to have small number of samples in the dataset of a typical microarray experiment, which unfortunately cannot support statistically significant discoveries. To overcome this problem, bootstrapping [48] is usually used to improve the confidence of discoveries. Especially, in Bayesian network structure learning, bootstrapping can be combined with model (structure) average procedure to discover the interactions with high confidence. In this paper, the averaging procedure is firstly applied to the Bayesian network learning process across different candidate structures. This procedure is then applied to the sample splitting step across different sample conditions to calculate the average score for each interaction. Next, the score of each interaction is averaged over the number of bootstrapping, and the confidence levels are estimated based on a statistical model as illustrated in the next paragraph. The interactions with high confidence (p-value <0.05) are selected to form the integrated network (called global network in the paper)

Consider again the example about learning the interaction between R1 and G1. Let n be the number of bootstrapping iterations, q c be the event of learning the interaction on the local data set D c of the cth condition (c∈{1,…,C}). As there are only two possible cases of interactions between R1 and G1, we approximate the whole learning process of the interaction between R1 and G1 as a Bernoulli experiment. We denote q c =1 when there is an edge between R1 and G1 (otherwise q c =0), and assume that p(q c =1)=p(q c =0)=0.5. q c follows a binomial distribution q c B(n,p), as the samples drawn with replacement in the bootstrap are independent [49]. At the integration stage by averaging, the interactions from local data sets D c are aggregated, and the interactions of the regulator R1 and its target G1 learnt through multiple data sets for the C different conditions (denoted as Q R 1 G 1 = c q c ) also follows a binomial distribution Q R 1 G 1 B(Cn,p). Adopting this statistical model, we are able to extract the learnt interactions at significant levels. The interaction that has the confidence level greater than the threshold will be included into the integrated global network. The Matlab codes for the whole process is available on request, and the results for the integrated global network is in Additional file 3.

Step (3): Network inference

A challenging problem of Bayesian network learning is the complexity of the resulting networks. Bayesian network learning usually produces a large number of interactions, including false discoveries. In this step, we extract from the global network learnt in the previous step the interplay between miRNAs and TFs. We search the learnt global network for all of the interactions between miRNAs and TFs. The resulting interplay network will help elucidate the complex regulatory mechanism in specific biological conditions.

In addition to discovering the interplay between miRNAs and TFs, we use the network motif finding algorithm, Cyclus3D [50], to discover the network motifs that involve at least 2 regulators. Network motifs are patterns of sub-graphs that recur at frequencies much higher than those found in randomised networks [2]. The randomised networks satisfy the following criteria: 1) they have the same number of nodes as in the real network, 2) each node in a randomised network has the same number of incoming and outgoing edges as the corresponding node has in the real network, 3) the randomised networks used to calculate the significance of n-node subgraphs were generated to preserve the same number of appearances of all (n−1)-node subgraphs as in the real network. These criteria ensure the randomised networks have the similar structure to the real network, and ensure that a high-significance pattern is assigned not because it has a highly significant sub-pattern [2].

The resulting motifs can be considered as simple building blocks from which the network is composed [51], and are believed to have specific functions which play critical roles in biological network inference [52]. The results presented in the next section show that this network inferences step can produce a set of interactions and molecules which are highly relevant to the biological condition of the EMT datasets.

Results

The output of the method are two types of networks: 1) the interplays between miRNAs and TFs, with their details shown in Figure 3 and Figure 4; 2) the results of network motif finding, which are the Feed Forward Loops (FFLs) that involve at least two regulators (see Figure 5).

Figure 3
figure 3

The interplays between TFs and miRNAs - miRNAs regulate TFs. miRNAs are coloured in red and TFs are in green. The confirmed edges are highlighted with bold lines. All of the nodes in the confirmed interactions are EMT bio-markers. They are the miR-200 family, ZEB1, ZEB2, and SNAI2 (SLUG). The miR-200 family that regulates ZEB1, ZEB2 for EMT has been confirmed by literature.

Figure 4
figure 4

The interplays between TFs and miRNAs - TFs regulate miRNAs. The confirmed edges are highlighted with bold lines. All of the nodes in the confirmed interactions are EMT bio-markers. SNAI2 (SLUG) regulates miR-200 family to indirectly control ZEB1 and ZEB2 for activating EMT regulation procedure. These interactions have been confirmed by literature.

Figure 5
figure 5

Network motif. Feed Forward Loop network motifs are extracted from the integrated global network. The genes identified in these network motifs are significantly enriched for biological functions of EMT.

From Figures 3, 4, and 5, we can see that the results generated by our method are compact with only a small number of interactions. These interactions have been shown to be highly relevant to the biological conditions of EMT, and also several EMT bio-markers which have been confirmed by literature are identified by our method. In the rest of this section, firstly we present the interactions and bio-markers that have been confirmed from literature, then we describe the enrichment analysis we have conducted to show the relevance of identified genes to EMT.

Confirmed interactions and bio-markers for EMT

Previous studies [33, 53, 54] have demonstrated that the miR-200 family targets the E-cadherin transcriptional repressors zinc finger E-box binding homeobox 1(ZEB1) and ZEB2 for EMT. These results have confirmed the interactions found using our method (shown in Figure 3), where we see that the hsa-miR-200 family (miR-200a, miR-200b, miR-200c, miR-429) regulates both ZEB1 and ZEB2. These interactions are the important interactions that are involved in the process of inhibition and induction of EMT. Figure 6 shows the process in detail where genes identified by our method are marked with red bars.

Figure 6
figure 6

The pathway of development_microRNA-dependent inhibition of EMT. Genes identified by our method are marked with red bars. miR-200 family regulates ZEB1 and ZEB2 in the process of inhibition and induction of EMT. These interactions are also identified by our method.

Apart from the miR-200 family, several important transcription factors that act as the bio-markers for EMT are also confirmed by our method. The two transcription factors, ZEB1 and ZEB2, which are regulated by the miR-200 family, are the markers in all three subtypes of EMT [55]. Another transcription factor that plays a crucial role in EMT is SNAI2 (SLUG). In fact, all known EMT events during development, cancer, and fibrosis appear to be associated with SNAI activation [56]. Our results suggest that SNAI2 indirectly regulates ZEB1 and ZEB2 by regulating the miR-200 family transcript (Figure 4), and in turn the miR-200 family regulates ZEB1 and ZEB2 (see Figure 3). This result is consistent with the literature as SNAI2 is confirmed to regulate the miR-200 family [57]. The other EMT bio-marker identified by our method is ETS1 (see Figure 4). It has been suggested that ETS1 is an upstream regulator of ZEB1 and ZEB2 [58], and therefore plays a critical role in activating the regulation of EMT.

Functions of identified genes being highly enriched for EMT

The functions of the identified genes (in Figures 3, 4, and 5) and the pathways which the genes potentially constitute are analysed using GeneGo Metacore from GeneGo Inc. and the Ingenuity Pathway Analysis (IPA, Ingenuity Systems, www.ingenuity.com). The genes identified as a result of the network inference step are significantly enriched for several biological functions. The top functions output from IPA that are known to be critical for EMT are gene expression, cellular development, cellular growth and proliferation, cellular movement, and cell death. Moreover, several genes belong to the classes of invasion and migration. These classes are sub-categories of cellular movement, and they have been confirmed as the functional markers of EMT [59]. This suggests that many target genes and their interactive regulators are involved in EMT. Table 1 shows the genes in the class of invasion and migration together with their p-values.

Table 1 Identified genes are significantly involved in the functional markers of EMT

In addition, the pathways which the genes in our results potentially constitute are identified using GeneGo Metacore. The statistically mapped pathways show that they are highly relevant to EMT. There are direct pathways that regulate EMT, such as the pathway of development_microRNA-dependent inhibition of EMT. This pathway shows the regulation of the miR-200 family and other miRNAs on the EMT bio-markers ZEB1 and ZEB2, and results in the inhibition and induction of EMT. Figure 6 shows the details of this pathway. Other direct pathways such as the development_TGF-beta-dependent induction of EMT via SMADs, and cell adhesion_tight junctions, are known to play critical roles in the regulatory procedure of EMT. The summary of these pathways and the corresponding p-values are given in Table 2.

Table 2 The statistically mapped pathways for EMT involve identified genes

Discussion

During the past few decades, considerable efforts have been made to explore the transcriptional regulatory networks in which transcription factors play the role as a main regulator. Other recent studies have investigated the post-transcriptional regulatory networks with miRNAs as the main regulator. However, with the ultimate goal of achieving a profound understanding of the mechanisms that control gene activities, it is sensible and desirable to find regulatory relationships involving both types of regulators from diverse sources of data.

In this paper, we utilise Bayesian network learning in constructing the network, but the integrated global network in general is not a Bayesian network. For instance, one of the requirements for Bayesian networks is that the network structure must be a Directed Acyclic Graph (DAG). Our integrated global network may include some loops of interactions where two regulators interact with each other, hence it is not a Bayesian network. Such cyclic network behaviour is more reasonable in reality, as more and more feedback loops between miRNAs and TFs are being reported. For instance, the ZEB/miR-200 pair is a feedback loop that regulates EMT [60]. Therefore, the integrated global network may provide more information than normal Bayesian networks which are DAGs.

In the network inference step, we use network motif finding algorithm to discover the sub-networks that recur at statistically significant level. Interestingly, the results from these small sub-networks still retain several important interactions and molecules relevant to the biological condition of the dataset. The relationships between the significance in frequency of graphs and biological functions are still open and interesting research topics. In the dataset used in this paper, the results are supportive for this hypothesis. An advantage of motif finding is that it produces a manageable number of interactions that can be used for further experimentation. The results from this paper, therefore, can provide good resources for future validating experiments.

While the network motifs found based on the regulatory network may provide useful patterns to guide biological experiments, these motifs depend on the structure of the regulatory network. The structure of the regulatory network in this paper is obtained based on the assumption that miRNAs and TFs are regulators and mRNAs are targets. However, the knowledge of gene regulatory relationships is still limited and the assumption may not always hold in reality. When the structure of the regulatory network changes the resulting network motifs may change too.

In the paper, we use the differentially expressed genes as the nodes for the gene regulatory network. We assume that genes whose expression levels do not change significantly between conditions would not play an important role in the regulatory network. There may be the case that a gene is the target of two regulators that cancel out each other, resulting in the expression level of the target gene unchanged. However, to make our method computationally practical we do not consider such cases.

To start the process of Bayesian network structure learning, target information is used to initialise the network. The target information based on sequence data may involve false discoveries. Bayesian network structure learning uses gene expression data to evaluate the confidence level of each interaction (edge) in the initial network, and only the interactions of high confidence are integrated into the global network. Therefore, graphically the set of edges in the global network is a subset of the set of edges in the initial network. The enrichment analysis shows that the important interactions for EMT are retained in the global network, demonstrating the effectiveness of the method. Other high-confidence interactions provide strong hypotheses for experimental validations.

Conclusions

In this study, we have proposed a framework for inferring complex gene regulatory networks using diverse sources of data, including target information for regulators, expression profiles, and sample categories. The interplay between regulators and the motifs with which they regulate target genes are revealed in the three-component network, and it is impossible to infer the interplay from any single regulator regulatory networks. The analysis of the EMT datasets has produced several results that have been validated by literature, a number of statistically significant interactions between miRNAs and TFs, and novel network motifs.