Background

Seasonal and pandemic influenza A virus (IAV) continues to be a public health threat and to exert a large economic burden worldwide [1]. IAV RNA segments that encode the hemagglutinin (HA) and neuraminidase (NA) proteins can undergo mutation (antigenic drift) or reassortment (antigenic shift), resulting in new viral strains that humans may lack the heterologous immunity against (e.g., the pandemic H1N1 2009 [2]). In such circumstances, the cell-mediated and humoral immune responses are primary, and a better understanding of the molecular mechanisms of immune responses to IAV infection thus becomes necessary to the development of more effective prevention and treatment strategies. Particularly, the regulations of cell gene expression are of significant importance due to their critical roles in shaping the magnitude and timing of immune responses via the promotion or suppression of the production of various proteins and RNAs (e.g., cytokines [3, 4] and miRNAs [5]). Since different cell populations with distinct phenotypes and functions are expected to have different gene regulation profiles, it is desirable to characterize the gene regulations for each cell population as in [6, 7]. In this study, the respiratory epithelial cells are of primary interest because they are the first barrier and the main target of IAV infection in human [8]. A number of previous studies have measured and analyzed the gene expression profiles in respiratory epithelial cells during IAV infection [912]; however, only a few studies investigated the gene expression regulators like transcription factors [13] or miRNAs [14]. Several studies in other biological disciplines show that TFs and miRNAs cooperatively interact with each other and both of them should be considered in the regulatory network model [15, 16]. To our best knowledge, simultaneous identification of both the transcriptional (TF) and post-transcriptional (miRNA) regulations on a genome-wide scale in human respiratory epithelial cells post IAV infection has not been sufficiently addressed.

A number of systems biology approaches have been developed to understand the transcriptional or post-transcriptional regulations [1719], including several studies particularly for influenza H1N1 virus infection [13, 20, 21]. For example, Butte and Kohane [22] considered the pair-wise mutual information to screen the gene data associations. Friedman et al. [23] employed the Bayesian network model to capture the conditional independencies between genes. Yeung et al. [24] used the singular value decomposition (SVD) and the robust regression for network reverse engineering, given the sparsity of real large-scale networks. Zhang and Horvath [25] proposed the weighted gene co-expression network analysis, which assigns a continuous connection weight between 0 and 1 to each gene pair. Basso et al. [6] combined the mutual information with the data processing inequality theory to analyze the gene expression profiles in human B cells. Ordinary differential equation models have also been considered in the previous studies [26], in which the computing efficiency issues were addressed for high-dimensional dynamic systems. However, as evaluated in [27, 28], the accuracy of these existing methods are not satisfying, and the reliability of network reverse engineering from gene expression data needs to be further improved. More importantly, two key issues are not sufficiently addressed in the previous studies: first, while both TFs and miRNAs are the major regulators in controlling gene expression [19, 29], none of the methods mentioned above simultaneously considered both transcriptional and post-transcriptional regulations such that the inferred interactions could be biased; second, most of the existing approaches are purely data-driven and thus the results could have a high false positive rate [3032].

In this study, we propose a new framework called SITPR, which stands for Systematic Identification of Transcriptional and Post-transcriptional Regulations, to identify the regulatory relationships by exploiting both curated knowledge and time-course expression data of mRNAs and miRNAs. For this purpose, a background regulatory network is constructed on a genome-wide scale by collecting the experimentally-observed interactions in literature or public databases [33, 34] as well as the potential regulatory interactions predicted by representative algorithms for quantifying the sequence-selective binding feasibility between TFs/miRNAs and genes [18, 35, 36]. Since this background network is not cell type or disease specific, only part of the interactions in this background will be activated in respiratory epithelial cells in response to influenza infection. To identify these activated regulatory interactions, we first employ several dimension reduction approaches, including the community detection algorithm based on network modularity [37], the time delay detection method [38, 39], the functional principal component analysis (fPCA) for screening differentially expressed genes [40], and the maximum information coefficient (MIC) [41]. After dividing the large network into much smaller modules, we propose the constrained LASSO method and combine it with the dynamical Bayesian network (DBN) model to detect the activated regulatory relationships from time-course expression data within each module. Our simulation studies suggest that the proposed framework can achieve satisfying performance in terms of sensitivity and specificity for networks of different sizes. We then apply the framework to the real experiment data and analyze the topological and functional features of the regulatory network in respiratory epithelial cells in response to H1N1 influenza virus infection.

Results and discussion

Background regulatory network obtained for human

Instead of inferring regulatory network only from expression profiling data, we first build a background regulatory network for human (see Additional file 1: Text S1 and Methods), and then identify the activated regulatory relationships during IAV infection using the mRNA and miRNA time-course expression datasets. We collect the documented regulatory relationships among TFs, miRNAs, and genes in various databases and literature, and also incorporate the potential regulatory relationships between regulators and targets predicted according to the sequence motifs of TFBSs (see Methods). For instance, we retrieve information from the ENCODE project [42] (Text S1) so the majority of the regulators and targets in ENCODE (102 out of 119 TFs, 723 out of 736 miRNAs and 13,607 out of 15,131 target genes) are included in our study. The resulted background regulatory network for human contains 23,079 nodes and 369,277 edges, consisting of 1,456 TFs, 1,904 miRNAs and 19,719 target genes. Among all these edges, 55.6% are experimentally validated interactions, and the rest are predicted regulatory relationships. Note that the regulatory interactions in the constructed background network are not limited to particular cell types or disease conditions; therefore, only part of these interactions will be activated under specific conditions (e.g., in respiratory epithelial cells infected by IAV).

For verification purpose, we calculate several statistical measures of the background regulatory network (Additional file 1: Table TS4), which clearly suggest that the obtained background network is different from a random network. More specifically, the clustering coefficient of our background network is 0.117, which is much higher than that of a random network of a comparable size (~ 1.5 × 10- 5) [43]. Also, the node degrees of our background network are found to satisfy the power-law distribution (Additional file 1: Figure TS3). Fitting the power law y = α ⋅ x- γ, where y denotes the number of nodes and x denotes the node degree, we obtain γ = 2.126, which is between 2 and 3 and thus suggests our background network is scale-free. This value also indicates that our background network is not random [43, 44].

The constructed background network is a high-dimensional network. To make the computing cost affordable, we employ several topology-based and data-based dimension reduction methods to divide the background network into smaller modules. We then identify the activated interactions within each module by fitting the dynamic Bayesian network model to the time-course expression data. To control the false positive rate, we also introduce the novel constrained LASSO formulation into the model fitting procedure.

Algorithm performance evaluation and comparison

Before we apply the SITPR framework to real data, the algorithm performance should be evaluated and compared using networks of different sizes. More specifically, the network size in terms of total node number is 10, 50 or 100, which is chosen to be comparable to the node numbers of the real network modules obtained in this study (ranging from 2 to ~200). We use linear ordinary differential equations (ODEs) to match the structure of a given network and generate the simulated time-course expression data. To be consistent with the real data used in this study, simulated data are generated at six time points (t = 0, …, 5) with six replicates at each time point. Gaussian white noises with a standard deviation of 10% of the data mean are added to all the data points. Six commonly-used criteria are chosen to evaluate the algorithm performance, including sensitivity (SN), specificity (SP), accuracy (ACC), F-measure, Matthews correlation coefficient (MCC), and the Area Under ROC Curve (AUC),

S N = T P T P + F N , S P = T N T N + F P , A C C = T P + F N T P + F P + T N + F N , F - measure = 2 × S N × S P S N + S P , M C C = T P × T N - F P × F N T P + F N T P + F P T N + F P T N + F N ,
(1)

where TP, FN, FP, and TN are the numbers of true positive, false negative, false positive, and true negative, respectively; also, AUC does not have a closed-form expression and is thus not given in Eq. (1).

For illustration purpose, we show a toy example of the regulatory system that consists of only 10 genes, which has a background regulatory network structure (in gray) as in Figure 1(A) and a set of activated regulatory relationships (in color) as in Figure 1(B). Furthermore, the miRNA is the G3 node labeled in magenta, and the true regulation coefficients are explicitly labeled next to each edge in Figure 1(B). The inferred regulatory relationships and coefficients from a randomly-chosen simulated dataset are shown in Figure 1(C), and the corresponding evaluation criterion values together with the ROC curve are plotted in Figure 1(D). Based on 10 simulation runs, we calculate the six criteria and the associated standard deviations. For this toy case, the proposed framework achieves a SN of 0.886 ± 0.090, SP of 0.971 ± 0.069, F-measure of 0.925 ± 0.070, MCC of 0.875 ± 0.144, ACC of 0.943 ± 0.667, and AUC of 0.966 ± 0.069 (also see Table 1).

Figure 1
figure 1

Illustration of the performance evaluation of SITPR using an example regulatory network. (A) The background regulatory network with 10 nodes. 'G3’ denotes a miRNA and is labeled in magenta. (B) The activated regulatory relationships (edges in color). The numbers next to the edges are the regulatory strengths. (C) An example of the inferred activated regulatory network using SITPR. (D) The ROC curve and the six performance evaluations at the maximum F-measure point.

Table 1 Performance evaluation and comparison of SITPR, PCC, MI, CLR, ARACNE, GENIE3 and TIGRESS for networks of three different sizes

It is necessary to further evaluate the SITPR performance for background regulatory networks of a higher dimension. Since the number of nodes in the largest modules obtained by dividing the real background network (see Methods) is on the order of 100, we consider networks with 50 and 100 nodes. For this purpose, we adopt two network structures from the DREAM3 challenge [28] so both of them have the small-world property and an exponential degree distribution. The smaller structure has 50 nodes and 62 activated regulatory relationships, and the larger structure has 100 nodes and 125 activated regulatory relationships (see Additional file 2, in which 1 denotes a positive regulation and -1 the opposite). We randomly add more inactive edges to the two networks (denoted by 0 in Additional file 2: Table S1) so the smaller background network has 62 out of 82 edges are activated and the larger network has 125 out of 166 edges are activated. Based on such network structures, we use the linear ODE model again to generate simulated data at six time points with six replicates at each time point, and add 10% Gaussian white noise to each data point.

It is also necessary to compare the proposed framework with other representative or state-of-the-art reverse engineering approaches. For this purpose, we consider the Pearson’s correlation coefficient (PCC) method, the mutual information (MI) method [27, 28], and the best four algorithms in the DREAM challenge, including CLR [45], ARACNE [46], GENIE3 [47] and TIGRESS [48]. For fairness of comparison, the same task is assigned to all the methods under comparison: identify the activated regulatory relationships from a given background network. The parameter settings of the other methods under comparison are from the corresponding literature, and we make necessary modifications for these methods to take a background network as input.

In Table 1, we summarize the performance comparison results for three different network sizes (10, 50, and 100 nodes, respectively) based on 10 simulation runs. For the 10-node networks, the SITPR approach has a remarkable performance in comparison with the other six methods. For example, the AUC of SITPR is 0.966 ± 0.069, while the PCC, MI, CLR, ARACNE, GENIE3 and TIGRESS methods only achieve an AUC of 0.596 ± 0.082, 0.567 ± 0.062, 0.540 ± 0.059, 0.577 ± 0.084, 0.609 ± 0.057, 0.586 ± 0.122, respectively. For the 50-node network, the performance of the SITPR framework is again notably superior to those of the other methods, evidenced by the fact that all the six evaluation criteria of SITPR are the largest among all the approaches under comparison. For instance, the AUC of SITPR is 0.703 ± 0.037, while the MI and PCC methods have the second and the third best AUCs of 0.574 ± 0.027 and 0.557 ± 0.038, respectively. As the number of network nodes increase to 100, the performances of all these approaches decrease; however, the SITPR method still outperforms all the other methods in terms of, e.g., AUCs.

To assess the robustness of the SITPR method against data noise, we generate 100 simulated datasets for each of the three noise levels: let the standard deviation of the Gaussian white noise be 10%, 20% or 30% of the data mean. As suggested in Table 2, the performance of the SITPR method only marginally deceases as the noise level increases, so we conclude that its performance is robust against data noise. For example, for the three different noise levels, the AUCs of our method are 0.986 ± 0.045, 0.924 ± 0.133, and 0.889 ± 0.158, respectively.

Table 2 Effects of the noise level and the total number of background edges on the performance of SITPR based on 100 simulation runs

We further evaluate the effects of the total number of edges in the background network on the SITPR performance. In Figure 1, we consider a background regulatory network with 14 activated interactions out of a total of 21 background regulatory relationships. Here we increase the number of background regulatory relationships to 40, 80 and 100, while the number of activated interactions remains at 14. As shown in Table 2, when the number of background regulatory relationships increases, the performance of our approach deceases marginally. Specifically, the AUCs of the SITPR method are 0.940 ± 0.060, 0.893 ± 0.095 and 0.811 ± 0.071 for 40, 80 and 100 background regulatory relationships, respectively.

In summary, the simulation study results provide direct evidences for the effectiveness and robustness of the proposed framework for systematically identifying activated transcriptional and post-transcriptional regulations from a given background network.

Regulations in respiratory epithelial cells during IAV infection

We apply the SITPR framework to understand both the transcriptional and post-transcriptional regulations that are activated in human respiratory epithelial cells (A549 cell line) in response to H1N1 influenza virus infection. In short, from the constructed background regulatory network at the genome-wide scale, we obtain 303 functional modules (2 to ~200 nodes and 1 to ~300 edges in a module) after dimension reduction, and then identify the activated transcriptional and post-transcriptional regulations simultaneously in each module using the dynamic Bayesian network model together with the constrained LASSO formulation (see Methods).

Network motifs are known as the functional building blocks that control gene expressions [15, 49]. Therefore, after we determine the activated regulatory interactions from the background network, we further identify 10 types of 'TF-miRNA-gene’ co-regulation motifs using the FANDOM algorithm [50]. For convenience, the results are summarized in Table 3, in which the numbers of the 10 types of motifs in the background network and in the activated network are both listed. The detailed motif structures can be found in Additional file 1: Figure TS5. The statistical significance of each motif is also evaluated by calculating the associated Z-score. For this purpose, 1000 random networks are generated, and the Z-score is calculated as the difference between the motif occurrence in the real network and its average occurrence in the random networks, divided by the standard deviation of the occurrences in the random networks. We perform such calculations for both the background regulatory network and the activated regulatory network; and a motif with a Z-score greater than 2 is referred as enriched and statistically significant. For respiratory epithelial cells in response to H1N1 influenza virus infection, the ten type of motifs in the background regulatory network after data-based dimension reduction (see Methods) involve 20,310 regulatory relationships among 621 TFs, 642 miRNAs and 7,356 target genes, while nine types of these motifs can be found in the activated network, which involve 4,774 regulatory relationships among 420 TFs, 431 miRNAs and 3,773 target genes (Additional file 3). In these motifs, the numbers of the transcriptional and post-transcriptional regulations are of the same order of magnitude (e.g., 3,140 'TF-gene’ versus 1,449 'miRNA-gene’ regulations), which suggests that the importance of the post-transcriptional regulatory interactions may be comparable to that of the transcriptional regulatory relationships.

Table 3 Ten co-regulation motifs and their occurrence frequencies in the background network and the activated network

The motifs in the activated regulatory network (Table 3) are worth of further investigation to better understand the synergy of these motifs in gene expression regulation. We thus combine these co-regulation motifs into larger components if they are connected or overlap with each other. These components involved 61 TFs, 48 miRNAs and 127 genes (Additional file 4). Limited by space, we plot the expression profiles of a part of these TFs, miRNAs, and target genes in Figures 2(A), (B), and (C), respectively; the structures of a part of the components are sketched in Figure 2(D). A specific example is the component centered at 'CEBPB’ (CCAAT/Enhancer-Binding Protein Beta), which is an important TF involved in immune and inflammatory responses [51]. In this component (see the upper left plot in Figure 2(D)), 'TRIM28’ (tripartite motif containing 28, a co-factor for 'CEBPB’ involved in certain signal transduction pathways of glucocorticoids and inflammatory cytokines [52]) and 'CEBPB’ are found to regulate the expressions of each other, and their expression patterns shown in Figure 2(A) are consistent with this prediction. 'hsa-miR-191’ is identified as the primary miRNA that targets on the 'CEBPB’ mRNAs, through which the expressions of a number of important genes are indirectly co-regulated. More specifically, the 'CCL5’ (chemokine ligand 5) gene, which encodes the RANTES chemokine that plays an active role in leukocyte recruiting and NK cell activation and proliferation (together with IFN-γ) [53], is found to be the co-regulation target. The 'ISG20’ gene (interferon stimulated gene) is involved in the antiviral function of interferon against RNA viruses [51], and 'ABCG1’ (ATP-binding cassette subfamily G member 1) is involved in macrophage cholesterol and phospholipids transport, and is crucial for IAV replications and clearance responses of immune cells [51]. Both 'ISG20’ and 'ABCG1’ are also the target genes co-regulated by 'hsa-miR-191’ and 'CEBPB’. Such results provide further evidence for the importance of co-regulations in immune responses to influenza virus infection.

Figure 2
figure 2

Expression patterns of selected TFs (A), miRNAs (B) and target genes (C) in the combined co-regulation motifs (D). The first motif cluster in (D) contains the motifs 'CEBPB’-'hsa-miR-191’-'CCL5’/'TRIM28’/'ISG20’/'ABCG1’ etc., and the second motif cluster contains 'BRCA1’-'hsa-miR-28-5p’-'TUBB’/'POLR2A’.

To gain a better view of the broad picture, we perform the functional enrichment analysis on each of the 303 modules using DAVID [54]. In Table 4, the enriched GO (gene ontology) biological processes [55] and the documented pathways in KEGG/REACTOME [33, 56] in 25 selected modules are listed for illustration purpose; for the complete analysis results, see Additional file 5. According to Table 4, the 'cell cycle’, 'cell proliferation’, 'positive regulation of programmed cell death’, and 'apoptosis’ are enriched, which suggests an active role of the cell-cycle related functions in response to IAV infection. In addition, a number of pathways (e.g., 'Influenza Infection’, 'Jak-STAT signaling pathway’, 'Wnt signaling pathway’, and 'MAPK signaling pathway’) are enriched in respiratory epithelial cells, with which a number of important immunological functions are associated (e.g., 'inflammatory response’, 'immune response’, 'I-kappaB kinase/NF-kappaB cascade’, 'cell activation during immune response’, and 'leukocyte activation during immune response’). For example, the interferon α/β receptor gene (IFNAR1) as well as the entire 'Jak-STAT signaling pathway’ in Module 106 are found enriched, which is consistent with the mechanism of the antiviral state development of lung epithelial cells conferred by Type I interferons [8, 57].

Table 4 Functional terms of GO biological processes and KEGG/REACTOME pathways enriched in 25 selected regulatory modules (FDR < 0.05)

Figures 2(A) and (B) illustrate the activated regulatory relationships in two example modules (Modules 173 and 179) at a higher level of granularity, corresponding to the two components from the left of the first row sketched in Figure 2(D). For transcriptional regulations in Figure 3, 'CCL5’ is found to be regulated by 'IRF5’ (interferon regulatory factor 5) and 'NFκB2’ (nuclear factor kappa-B P100) besides 'CEBPB’. The induction of 'CCL5’ by 'IRF5’ and the regulatory roles of 'NFκB’ and 'CEBPB’ in 'CCL5’ expression have been reported in previous studies [53, 58, 59] , and are successfully identified by our method. In addition, the chemokine 'IL-8’ is found to be regulated by two TFs, 'NFκB2’ and 'ETV4’ (ETS translocation variant 4) [34, 53]. However, for such many-to-one regulatory relationships, it is unclear which TF is the major regulator and further experiments are thus needed to clarify this issue. Multiple post-transcriptional regulations can also be found in Figure 3. For example, 'hsa-miR-191’ is identified to suppress gene 'REPS1’ (RALBP1 associated Eps domain containing 1) in Figure 3(A), and 'hsa-miR-28-5p’ to suppress gene 'CIT’ (rho-interacting, serine/threonine kinase 21) in Figure 3(B). 'REPS1’ is involved in cytoskeletal signaling pathway and 'CIT’ encodes a serine/threonine protein kinase that functions in cell division and functions to promote efficient cytokinesis [51]. The repression of the two genes by miRNAs suggests a disorder of host factors caused by IAV infection via post-transcriptional regulations. Moreover, 'IFNGR2’ (interferon gamma receptor 2) suppression by 'hsa-miR-644’ and 'IFNG’ (interferon gamma) suppression by 'hsa-miR-26a’ are identified in Modules 107 and 287, respectively. This indicates the importance of miRNAs in innate immune responses against IAV infection by interfering IFN production and signaling [53, 60]. As an example of co-regulations, the 'hsa-miR-28-5p’-'BRCA1’-'TUBB’ regulation motif can be seen in Figure 3(B). 'TUBB’ (tubulin beta class I) is a major constituent of microtubules [51] in the REACTOME 'Cell cycle’ pathway [56], and is also a known human host proteins for IAV replication [21]. 'BRCA1’ is well known as a key mediator for repairing DNA damage by maintaining the genomic integrity [51]. A better understanding of the activated regulation among 'hsa-miR-28-5p’, 'BRCA1’ and 'TUBB’ could be important to designing the miRNA gene therapeutic trials targeting IAV host factors [21].

Figure 3
figure 3

The activated regulatory relationships in two example modules. TFs, miRNAs and genes are in cyan, magenta and green, respectively. The up-regulation and down-regulation are labeled in red and blue, respectively. The background regulatory relationships which are not activated are in gray. The 'arrow’, 'T’ and 'diamond’ shapes of edge terminals denote to up-, down-, and undetermined- regulations, respectively. (A) Module173. This module contains the co-regulation motifs 'CEBPB’-'hsa-miR-191’-'CCL5’ and 'CEBPB’-'hsa-miR-191’-'ALB’/'ISG20’. It also contains some two-node regulatory motifs, e.g., 'ETS2’-'ETS2’, 'NFKB2’-'hsa-miR-1227’, 'IRF5’-'CCL5’ and 'hsa-miR-191’-'REPS1’. (B) Module179. This module contains the regulatory motifs 'BRCA1’-'has-miR-28-5p’-'TUBB’/'POLR2A’, “EGR1’-'hsa-miR-155*’, 'EGR1’-'hsa-miR-146a’-'LTB’ and 'BRCA1’-'hsa-miR-146a’-'PHF1’.

Conclusions

In this study, we proposed a novel framework for systematic identification of transcriptional and post-transcriptional regulations (SITPR) using curated knowledge, sequence-based predictions and time-course expression data. Briefly, a comprehensive background regulatory network, which is not cell type- or disease-specific, was constructed at the genome-wide scale using information from ~20 databases and predictions from representative algorithms. After dividing the background network into smaller modules, we further proposed the constrained LASSO method and applied it with the dynamic Bayesian network model to each of the modules to determine the activated regulatory relationships based on the time-course expression data of mRNAs and miRNAs. The simulation studies clearly suggest the superiority of the proposed framework over other representative approaches in the context of identifying activated regulatory relationships from a given background network. The SITPR framework was then applied to real data to identify the activated regulatory relationships among TFs, miRNAs and target genes in human respiratory epithelial cells in response to H1N1 influenza virus infection.

Different from many existing methods for inferring gene regulatory network that are purely data-driven [17, 27, 28], we incorporated curated knowledge and utilized sequence information besides time-course gene expression data. Including curated knowledge can improve the accuracy of identifying genuine regulatory relationships [30, 32, 61], and the incorporation of predicted potential regulatory interactions among TFs, miRNAs and genes allows to discover novel regulatory relationships between regulators and targets. Furthermore, the way we constructed the background network (e.g., using predictions of sequence binding instead of associations between expression data [42, 62, 63]) and the use of the constrained LASSO help to control the false positive rate in identifying activated regulatory relationships, as suggested in [64, 65]. Also, since we considered the transcriptional and post-transcriptional regulations simultaneously, the identified activated regulatory relationships could be less biased [62, 63, 66, 67]. It should be mentioned that, limited by data availability and algorithm prediction accuracy, our background network built in this study is not thorough; however, the SITPR framework is flexible and allows us to update the background network whenever new knowledge is discovered or better prediction algorithms are devised.

Using the proposed framework, the transcription regulatory landscape was depicted for human respiratory epithelial cells infected by H1N1 influenza A virus. The transcriptional and post-transcriptional regulations were simultaneously identified and analyzed for airway epithelial cells in response to IAV infection. Besides pairwise interactions, we paid particular attention to regulation motifs and their synergistic effects on gene expressions. We identified ten regulation motifs using FANDOM and selectively presented and discussed some important motifs activated during IAV infection, which could suggest potential targets for influenza treatment or vaccination. It should be addressed that the regulation motifs we defined in this study are not only the functional building blocks but also the topological elements in a complex regulatory network. Identification of such motifs can provide the basis for advanced analyses such as differential network biology [68].

The SITPR is also applicable to a pre-defined gene set. For example, Figure 4 illustrates the activated regulatory relationships in the KEGG IAV pathway, in which we included the genes contained in this pathway as well as the neighboring miRNAs in our background network. One can tell from Figure 4 that 'STAT2’ (signal transducer and activator of transcription 2) directly regulates the expressions of 9 different miRNAs, 5 target genes and two other TFs, suggesting a highly influential role of 'STAT2’. In addition, we found that 'STAT2’, 'IRF9’ (interferon regulatory factor 9) and 'hsa-miR-583’ form a co-regulation motif. 'STAT2’, 'hsa-miR-519b-3p’ and 'JAK1’ (Janus kinase 1) also forms another co-regulation motif. 'STAT2’, 'JAK1’ and 'IRF9’ are known to be the key regulators in the 'Jak-STAT signaling pathway’, which plays a critical role in regulating immune responses [33]. The co-regulation motif 'MAPK1’-'hsa-miR-543’-'IL8’ was identified, which may indicate an important role of miRNA 'has-miR-543’ in controlling the production of the chemokine 'IL8’. In short, one obvious benefit of analyzing a pre-defined gene set is that the network dimension is significantly reduced such that SITPR can achieve a better accuracy, as suggested by our simulation studies. Loveday et al. [14], in which the time-course expression data used in this study were generated, also conducted a regulation study on a pre-defined gene set. However, they selected hundreds of differentially expressed genes and only focused on the post-transcriptional regulations between miRNAs and target genes. Also, the regulatory relationships identified in Loveday et al. [14] were primarily based on the baseline and the data on hour 8. Therefore, although there exist a few regulatory relationships identified in both our study and Loveday’s (e.g., 'hsa-miR-328’ and 'PSME3’), the majority of the regulatory interactions identified in the two studies are different due to all the reasons described above.

Figure 4
figure 4

The activated regulatory network in the KEGG IAV gene set. The genes in the IAV pathway together with their neighbors from the background network were analyzed. The importance of miRNAs can be told from the co-regulation motifs, such as 'STAT2’-'hsa-miR-583’-'IRF9’, 'ATF2’-'hsa-miR-374b’-'JAK1’, 'IKBKB’-'hsa-miR-218’-'AKT1’, and 'MAPK1’-'hsa-miR-543’-'IL8’.

In summary, the background regulatory network constructed in this study provides a start point for the identification of condition-specific regulatory interactions. Together with the selected dimension reduction techniques, the SITPR framework is formed to achieve a better accuracy at an affordable computing cost. Also, for human respiratory epithelial cells in response to influenza A virus infection, the results generated using SITPR reveal a number of interesting activated regulatory interactions and provide useful clues for future experimental design and validation.

Methods

Expression data and experiments

The time-course expression data of mRNA and miRNA in human A549 cells infected with influenza H1N1 virus (A/Mexico/InDRE4487/2009) at a multiplicity of infection (MOI) of 0.1 were downloaded from NCBI GEO [69] (GSE36553 and GSE36461). The samples of mRNA and miRNA were hybridized on Illumina HumanHT-12 v3 BeadChips and Febit miRBase 14 Geniom miRNA Biochip, respectively. Data at six time points (0, 4, 8, 24, 48, and 72 hours post infection) were collected with six replicates at each time point [14]. However, one chip sample on hour 4 was found to be degraded [14], so only 35 samples were usable for analysis. More details of the experiment and raw data preprocessing can be found in the original manuscript [14]. We mapped the probeset IDs to NCBI official gene symbols using the GEO annotation file. When multiple probesets were mapped to the same gene, the probeset with the maximum inter quartile expression range was selected.

Construction of the background regulatory network

TFs and miRNAs are the major transcriptional and post-transcriptional regulators, respectively [18, 19]. Simultaneously considering the interplays among TFs, miRNAs and target genes could result in a more accurate identification of regulatory relationships [62]. Therefore, five types of regulatory relationships need to be accommodated in the background network as illustrated in Figure 5(A): 'TF-gene’, 'miRNA-gene’, 'TF-miRNA’, 'miRNA-TF’ and 'TF-TF’ self-regulation. For this purpose, we collected a large number of documented regulatory relationships from about 20 public databases and predicted the potential regulations among TFs, miRNAs and target genes by scanning the sequence binding motifs. It should be addressed that the background regulatory network is not cell type or disease specific since it is constructed at the genome-wide scale. Also, the background network allows the discovery of novel regulations by including potential regulations. Instead of associations in gene expression data, the potential regulations are predicted by matching the sequence binding motifs of targets, which is helpful to reduce the false positive rate (FPR) in the inferred regulations [70].

Figure 5
figure 5

Illustration of regulatory relationships and the dynamic Bayesian network (DBN) model. (A) Five types of regulatory relationships among TF, miRNA and target gene. (B) Example of a DBN model for a 3-node network.

Specifically, human TFs were compiled from FANTOM [71], TRANSFAC [35] and JASPAR [36]; human miRNAs were downloaded from miRBase [72]; and human genes and annotations were from NCBI GenBank and RefSeq [73]. For convenience, all the human TFs and genes were denoted using their NCBI official gene symbols. For the 'TF-gene’ and 'TF-TF’ regulations, we downloaded the curated regulations from TRED [34] and KEGG [33], and the regulatory DNA elements of transcriptional factor binding sites (TFBS) were extracted from TRANSFAC [35] and JASPAR [36]. For binding motifs in TRANSFAC, we identified the putative binding regions of TFs in human genome using the tables tfbsConsSites and tfbsConsFactors from UCSC Genome Browser [74], as suggested by the ENCODE project [42]. When mapping to the human genome (UCSC h19 human genome assembly), the binding regions were searched within the promoter region from 5-kb upstream to 1-kb downstream of the transcription start site (TSS) annotated by RefSeq. We used the default Z-score of 2.33 to control the type I error in binding region prediction. For JASPAR TFBS annotation, we used the MotifFeatures and AnnotatedFeatures tables from Ensembl [75] and extracted the TFs and their binding sites regions in the human genome. Again, we mapped the binding sites to human genome and identified the regulatory relationships between TFs and their potential targets. For completeness, we also extracted the TF and their target proteins from human protein-protein interaction data in HPRD [76] and KEGG [33], and mapped such interactions to the 'TF-gene’ and 'TF-TF’ regulations, which allows a more thorough and systematic exploration of the regulatory interactions as suggested in previous studies [62]. For the 'TF-miRNA’ regulations, the previously-reported TF-miRNA interactions were downloaded from TransmiR [77]. We used a procedure similar to that for the 'TF-gene’ and 'TF-TF’ regulations to identify the regulations between TFs and miRNA’s corresponding genes. For the 'miRNA-gene’ and 'miRNA-TF’ regulations, the curated miRNA-gene interactions are available in public databases such as Tarbase [78], miRecords [79] and miRTarBase [80]. There are also numerous computing methods for predicting miRNA targets, such as TargetScan [18] , miRanda [81], PicTar [82], microT [83] and MicroCosm [72]. We downloaded the experimentally-confirmed miRNA-gene interactions from the three pubic databases, and the potential interactions were kept only if they were predicted by at least two of the five algorithms mentioned above.

Finally, we built a background regulatory network for human with 23,079 nodes and 369,277 edges, consisting of 1,456 TFs, 1,904 miRNAs and 19,719 target genes. The statistics and more details of the background network are available in Additional file 1. For the real data we used in the Results Section, 37,586 mRNAs and 904 miRNAs were measured in the experiments. The overlap between the background network and the experiment data contains 18,964 nodes (1,441 TFs, 881 miRNAs, and 16,642 target genes) and 335,963 interactions. All the data and codes can be accessed at our website http://doc.aporc.org/wiki/SITPR.

Dimension reduction

Although the number of edges in the background network is significantly smaller than that of a fully connected network, the computing cost is still prohibitively high if we directly use the entire background to infer the activated regulatory relationships. We thus considered several data-based and topological feature based methods for dimension reduction.

First, we only kept the differentially expressed genes during viral infection, which were identified at a significance level of 0.05 using our functional PCA approach particularly designed for time-course data [40]. Second, we filtered out the regulatory pairs if the initial change in the regulator gene expression lags behind that in the target gene expression [38]. For this purpose, we clustered the time-course expression data of mRNAs and miRNAs using STEM [39] and then determined the time of initial change in gene expression for each cluster. That is, within each cluster, the data at the same time point are independent replicates as suggested by the experiment design [14]; therefore, the two-sample t-test with Bonferroni correction was conducted to compare the expression levels at t = 1, …, K (K = 5) with that at t = 0, and the first time point with a p-value less than 0.01 was deemed as the initial change time point. Obviously, the target genes as well as TFs in the same cluster will have the same time of initial change. Then if the initial change time of a TF is later than that of its potential target genes, the corresponding TF-gene interactions are filtered out [38]. Third, we filtered out the regulatory pairs that have no any association in their expression data. For this purpose, we calculated the MIC scores (and the associated p-values) [41], which can measure both linear and nonlinear associations. The associations that have a p-value less than 0.05 were kept, and the MIC scores were assigned to the corresponding regulatory interactions as the edge weights for later use.

For further dimensional reduction, the weighted background regulatory network, which was generated at the third step in the previous paragraph, was divided into multiple smaller modules using the fast community detection algorithm [37] such that the connections within modules are much denser than those between modules [84]. That is, the nodes within each module are tightly connected by many edges, while the nodes in different modules are connected by a smaller number of edges. Therefore, the generated modules can be deemed as the functional blocks of the regularity network [85, 86]. We obtained 303 modules with the number of nodes in these modules ranging from 2 to ~200 and the number of edges ranging from 1 to ~300. Finally, it should be mentioned that the edge weights are not used in any further analysis.

DBN and constrained LASSO

Since miRNAs repress the translation from mRNAs to proteins and TFs may regulate themselves, a directed graph model with cycles is necessarily needed to model each regulatory module. The dynamic Bayesian network (DBN) [38, 87] was considered in this study, as illustrated in Figure 5(B). Let X t = x 1 t , , x n t T denote the gene expression vector of n genes at time t (t = 0, …, K), the joint distribution function can be given as follows [30]

f X 0 , , X K =f X 0 t = 1 K i = 1 n f x i t | parent x i t
(2)

where we assume that Xt + 1 only depends on Xt (the Markov chain property) and the form of dependence can be described by a linear model [38, 61, 87]. More specifically, we used the model

X t + 1 =A X t +E
(3)

where A is a matrix of constant regulatory coefficients, E is the error vector that follows a multivariate normal distribution N(0, Σ) with Σ=diag σ 1 2 , , σ n 2 . In matrix A, let ai,j denote the regulatory coefficient between the regulator x j and the target x i . According to this definition, ai,j = 0 if there exists no regulation between x j and x i in the background network, ai,j > 0 when a positive regulation exists, and ai,j < 0 when a negative regulation occurs.

To control the false positive rate in model selection for a high-dimensional linear regression problem as in Eq. (3), the LASSO method can be considered [64]. However, the existing LASSO formulations such as the adaptive LASSO [88] or the generalized LASSO [65] cannot directly accommodate the constraints posed on ai,j; for example, the regulatory coefficients associated with miRNAs are always negative. Therefore, we propose the so-called constrained LASSO, which is given as follows

min X ˜ A ˜ - B ˜ L 2 2 + λ A ˜ L 1 , s . t . G ˜ A ˜ 0 , S ˜ A ˜ 0 , E ˜ A ˜ = 0 ,
(4)

where A ˜ n 2 × 1 = a 1 , 1 , , a 1 , n , , a n , 1 , , a n , n T denotes the coefficient vector generated by stacking the columns of A, X ˜ n K × n 2 denotes the direct sum of the matrices X 0 , , X 0 n × n T ,…, X K - 1 , , X K - 1 n × n T , and B ˜ n K × 1 is the vector generated by stacking the expression vectors X1, …, XK. In addition, G ˜ n 2 × n 2 is a diagonal matrix with its (n(j - 1) + in(j - 1) + i) entry being 1 only if ai,j ≥ 0; otherwise, the diagonal entry is set to zero. S ˜ n 2 × n 2 and E ˜ n 2 × n 2 are constructed in a similar manner, representing negative and equality constraints, respectively. Also, experimentally-confirmed regulatory relationships correspond to constraints ai,j ≠ 0; for such constraints, we introduced two nonnegative variables such that a i , j = a i , j + - a i , j - and the constraints on a i , j + and a i , j - can be incorporated into G ˜ . In addition, λ ≥ 0 is the penalty coefficient and the optimal λ value can be determined using a cross-validation procedure. It has been shown by previous studies that the optimal λ value should lie within the interval [0, λmax], where λ max = 2 X ˜ T B ˜ L [89]. We thus search the equally-spaced grids in this range using the five-fold cross validation [64] for the optimal value, which corresponds to the minimum square prediction error [64, 89]. Finally, it should be mentioned that the optimization problem in Eq. (4) is solved using quadratic programming [64, 88].