Background

A critical challenge in molecular biology is understanding the biological mechanisms underlying precise spatiotemporal regulation of gene expression in mammals. Significant regulation is known to occur at the level of transcriptional initiation and elongation [1], through the combinatorial interactions of transcription factors (TFs) [2, 3] and histone modifications (HMs) [4, 5]. By binding to specific DNA motifs, activator or repressor TFs regulate the recruitment and behaviour of RNA polymerase II (RNAP-II). Direct interactions between TFs and the transcription pre-initiation complex require genomic proximity to the transcription start site (TSS) or higher-order chromatin looping [6], corresponding with TF-binding motifs in the promoter or enhancer/silencer regions respectively [2, 7]. Post-translational modifications of the amino-termini of nucleosomal histones are also known to regulate transcription [8, 9] by either modulating the local chromatin structure to control TF accessibility [4] or directly recruiting chromatin remodellers and other related enzymes [10]. Altered gene expression caused by abnormalities in TF or HM patterns has been directly associated with hundreds of human diseases [3], including leukaemia [11], prostate cancer [12] and various developmental, autoimmune, neurological, inflammatory and neoplastic disorders [13].

The complex relationship between TFs and HMs is still largely unexplored. Statistical models have recently been developed to integrate high-throughput omics data with the aim of understanding the regulatory logic that follows from these interactions (recently reviewed in [14]). These models demonstrated that TFs and HMs are accurate predictors of mRNA transcript abundance in several organisms and cell types. However, the utility of this data-driven framework is not the ability to predict gene expression, but rather the insights that can be gained from investigating the putative regulatory interactions captured by an accurate model. A recent study showed that models constructed from position weight matrix-predicted TF binding, when combined with a tissue-specific H3K4me3 prior, yield similar prediction accuracy to models constructed from actual chromatin immunoprecipitation sequencing (ChIP-seq) data [15]. Furthermore, a principal component analysis of these models was able to extract correctly the established regulatory roles (i.e., activator or repressor) of 20 TFs and HMs in mouse embryonic stem cells (mESCs) [14].

In addition to providing a powerful explorative framework, predictive modelling of gene expression has yielded an unexpected and previously unexplained result: that TFs and HMs are statistically redundant in explaining mRNA transcript abundance at a genome-wide level. Moreover, redundancy has been identified both within and between TFs and HMs in mESCs [16], to the extent that a single TF (E2f1) is almost as informative as a panel of 20 TFs and HMs with well-established regulatory roles [15]. Here, statistical redundancy equates to two variables providing equivalent information (e.g., due to being strongly correlated), and it is important to appreciate that this does not necessarily imply functional redundancy (i.e., removing either element does not affect gene expression). Assuming the existence of functional redundancy between TFs and HMs outwardly contradicts our understanding of transcriptional regulation, in which TFs and HMs play complementary yet distinct roles in RNAP-II recruitment and elongation.

In this study, we investigate the fundamental cause of the statistical redundancy within and between TFs and HMs. First, we validate the robustness of previous findings by constructing genome-wide predictive models for different mammalian cell types and modelling algorithms. We confirm that TFs and HMs are both predictive of gene expression (measured by mRNA transcript abundance) and statistically redundant at a genome-wide level. Our analysis was extended by constructing individual models for thousands of ontology-classified biological processes. By diverging from previous genome-wide analyses, we identify significant variance in the distribution of statistical redundancy across these processes, which we attribute to regions of open nucleosome-sparse chromatin maintained by the activity of boundary proteins and enriched for housekeeping genes. Finally, we discuss several implications of our findings and how they contribute to the overall understanding of regulatory logic in mammalian systems.

Results and discussion

Transcription factors and histone modifications are predictive of mRNA transcript abundance

As TFs and HMs are known to play critical roles in regulating transcription, accurate predictive models of mRNA transcript abundance have been constructed from corresponding ChIP-seq binding data for various organisms, cell types and modelling techniques [14, 1719]. To validate the robustness of these findings, we constructed both log-linear and support vector regression (SVR) models for two mammalian cell types: mESCs and human lymphoblastoids (GM12878).

Table 1 presents the prediction accuracy of log-linear and SVR models constructed from three sets of data: TF binding (TF), HM and DNase-I hypersensitivity (HM+DNase; both proxies for chromatin accessibility) and the concatenation of both (TF+HM+DNase). The proportion of transcript abundance variation explained by each model (adjusted R2) was calculated using a tenfold cross-validation [20], with the presented adjusted R2 values capturing the mean and standard deviation of these folds. The relationship between measured and predicted mRNA transcript abundance is visualised in Additional file 1: Figure S1 and Additional file 2: Figure S2 for mESCs and GM12878 cells, respectively.

Table 1 Prediction accuracy of predictive models of mRNA transcript abundance

It is evident that small sets of TFs, HMs and DNase are predictive of genome-wide mRNA transcript abundance in mESCs, as reported in previous studies [18]. We have further demonstrated that these results extend to different mammalian cell types and are robust against algorithm selection. As support vector regression (SVR) yields minimal improvement despite a two-order-of-magnitude increase in required CPU time, only log-linear regression is applied throughout the remainder of this study.

Transcription factors and histone modifications are less predictive of mRNA transcript abundance in differentiated cells

Our results reveal that models of transcriptional regulation in GM12878 cells are less accurate than those constructed for mESCs. To ensure that this does not simply reflect the different proportion of zero-expression genes in the GM12878 and mESC data, we constructed log-linear regression models considering only genes with R/FPKM(fragments per kilobase per million)-normalised transcript abundance >0. These models yielded an average reduction in adjusted R2 prediction accuracy of 58% and 7% for GM12878 and mESC, respectively (not shown), excluding a high proportion of zero-expression genes in the differentiated GM12878 cell line as the underlying cause of the observed performance gap. The removal of these genes also adversely affected subsequent analysis (not shown), as much of the information used to elucidate the silencing roles of some regulatory elements (e.g., H3K9me3 and H3K27me3) is lost.

One explanation for the performance gap between mESC and GM12878 models may be the selection of individual TFs, which vary between cell types. The 12 TFs selected for mESCs (see Table 2) are known to play important regulatory roles specific to embryonic stem cell biology; i.e., as self-renewal regulators and pluripotency reprogramming factors [21, 22]. Initial differentiation of embryonic stem cells involves silencing of these TFs and activation of developmental regulators [23, 24], necessitating the selection of alternate TFs for GM12878 modelling. Although the 11 GM12878 TFs chosen are known to play key roles (see Table 2) in regulating various cellular, metabolic and development processes [3], it is possible that they represent a smaller fraction of the key regulators than the 12 considered for mESCs (in which regulatory logic is better characterised).

Table 2 Mus musculus (embryonic stem cell) data

A more likely explanation for the performance gap between predictive models for embryonic stem versus differentiated cells is the increasing heterogeneity in regulatory mechanisms following differentiation. In embryonic stem cells, the majority of gene promoters containing CpG (dinucleotide) islands are characterised by H3K4me3-bearing nucleosomes. Many of these genes are maintained in a bivalent state with the inherently antagonistic H3K27me3 repressive mark; these genes are expressed only at low levels, but are poised for rapid transition to active or silenced states in response to differentiation signals or other extracellular stimuli [36]. Accordingly, the genome-wide correlation between H3K4me3 and H3K27me3 in mESCs is very high (Pearson’s r = 0.78 versus -0.20 in GM12878, not shown).

Lineage-specific gene expression programs exhibit far less regulatory homogeneity; e.g., many genes are silenced by H3K27me3/polycomb-mediated facultative heterochromatinisation, whereas others are silenced by H3K9me3/HP1-mediated constitutive heterochromatinisation and subsequent DNA methylation. Synergistic and conditional relationships become more widespread (e.g., H3K4me3 is positively associated with expression only in the absence of H3K27me3), limiting the effectiveness of regression models only able to capture additive and simple-multiplicative relationships.

Future predictive modelling studies of differentiated cells could integrate information for the H2A.Z histone variant (not assayed), which is critical for maintaining metastable equilibrium between antagonistic H3K4me3 and H3K27me3 [37, 38], and could therefore be used to classify genes subject to different regulatory logic.

Individual transcription factors are statistically redundant for predicting mRNA transcript abundance

Functional redundancy between individual TFs has previously been observed in Saccharomyces cerevisiae[5] and Drosophila melanogaster[19] and proposed as an important mechanism in eukaryotes [2]. To investigate the existence of similar redundancy within mammalian TFs and HMs, log-linear regression models of genome-wide mRNA transcript abundance were constructed for all combinations of n TFs and m HMs and DNase considered in this study. Figure 1(a,b) demonstrates the adjusted R2 distributions for these 4,095 mESC TF models and 255 mESC HM+DNase models respectively, with the minimum and maximum prediction accuracies for each n and m connected by the blue and red curves. The corresponding results for GM12878 models are presented in Figure 1(c,d).

Figure 1
figure 1

Statistical redundancy within TFs and HMs in predicting genome-wide mRNA transcript abundance. (a,b) mESCs and (c,d) GM12878 cells. Adjusted R2 distributions of the log-linear regression models for all combinations of n TFs (a,c) and m HMs and DNase (b,d). The minimum and maximum prediction accuracies for each n and m are connected by the blue and red curves, respectively. Although models constructed from more regulatory elements generally yielded improved prediction accuracy, the rapidly diminishing improvement when adding additional elements to the model suggests significant statistical redundancy within TFs and HMs. It is important to note that statistical redundancy does not necessarily imply functional redundancy. HM, histone modification; TF, transcription factor.

Although models constructed from greater numbers of regulatory elements generally yielded improved prediction accuracy, this improvement diminished considerably for values of n or m greater than 4. This is particularly evident for mESC TF models; despite all 12 TFs being known to play key roles in mESC biology [21], a model constructed from E2f1 data alone yielded equivalent mRNA transcript abundance prediction accuracy (adjusted R2 = 0.57) as a model integrating all 12 (adjusted R2 = 0.59). The most predictive TFs (E2f1, c-Myc, n-Myc and Zfx) were those known to localise preferentially to promoter regions (e.g., E2f1 was found to bind to more than 60% of mESC promoters), whereas the least predictive TFs (Smad1 and Stat3) bind further from the TSS (e.g., Smad1 was found to bind to less than 3% of mESC promoters).

Previous studies have demonstrated that many TFs (including E2f1 and Oct4) do not require the presence of a consensus motif to bind in vivo, but rather may be recruited to the promoter with the assistance of other bound TFs [39, 40]. These results may partially explain the observed statistical redundancy between mammalian TFs; i.e., if one TF is necessary for the recruitment of another, these TFs will provide similar predictive information regarding the regulatory state despite their distinct functional roles. It should be noted that this is distinct from functional redundancy; i.e., it does not imply the removal of any individual TF would not affect gene expression. It has previously been proposed that the parallel deployment of cooperatively bound TFs confers robustness to gene expression, both allowing the regulatory state of a gene to reject signalling noise and providing control through activation or inhibition of multiple signalling pathways [41].

Transcription factors and histone modifications provide equivalent information regarding genome-wide transcriptional regulation

In Table 1, we present the prediction accuracy of log-linear and SVR models constructed from three sets of data: TF binding (TF), HM and DNase-I hypersensitivity (HM+DNase) and the concatenation of both (TF+HM+DNase). In addition to the findings described earlier, it is apparent that the TF+HM+DNase models perform only marginally better than those constructed from TF or HM+DNase data alone, irrespective of algorithm or cell type. Although TFs and HMs independently provide significant information regarding transcriptional regulation, it appears that they provide the same information and are therefore statistically redundant. To quantify this phenomenon, we performed a partial correlation analysis by calculating the correlation between genome-wide transcript abundance prediction residuals for TF and HM+DNase models [42]. The residuals were found to be highly correlated (Pearson’s r > 0.8 for both mESCs and GM12878 cells, not shown), indicating a significant degree of association between the TF and HM+DNase data. These results (previously identified for mESCs [15, 16]) outwardly contradict our understanding of transcriptional regulation, in which TFs and HMs play complementary yet distinct roles in RNAP-II recruitment and elongation.

Prediction of transcript abundance for genes grouped by biological process suggests a more heterogeneous role for transcription factors and histone modifications

To investigate the source of statistical redundancy between TFs and HMs, we constructed individual predictive models for thousands of ontology-classified biological processes. Insight can be gained into the nature of this redundancy by investigating its distribution across the smaller groups of genes contributing to each process.

The mESC and GM12878 genes were grouped by ontology-classified biological process. Two regression models were constructed for each set of genes: one considering only TF-binding data and the other considering only HM and DNase data. The ratio of adjusted R2 values for the TF and HM+DNase models was calculated to capture their relative performance.

Of the 1,880 mESC processes considered, 25 were found to exhibit a significant TF-to-HM+DNase adjusted R2 ratio (i.e., demonstrating that TF binding is more predictive of mESC mRNA transcript abundance than HMs and DNase for the markers considered in this study, Benjamini–Hochberg-corrected P < 0.05 [43]). A full list of these processes and their respective ratios is provided in Additional file 3: Table S1. Furthermore, 523 processes were found to exhibit a significant HM+DNase-to-TF adjusted R2 ratio (i.e., demonstrating that HMs and DNase are more predictive of mRNA transcript abundance than TF binding). These processes and their respective ratios are listed in Additional file 4: Table S2. The specific TF and HM+DNase adjusted R2 values used to calculate the ratios for each of the 1,880 processes are provided in Additional file 5: Table S3.

The distributions of adjusted R2 values for models constructed from mESC TF and HM+DNase data (visualised in Figure 2) demonstrate that their relative predictive power is heterogeneous across different biological processes. A similar distribution of adjusted R2 values is evident for GM12878 data (not shown), although statistically significant outliers were not confidently identified due to an overall lower prediction accuracy of GM12878 models (described earlier). It is important to note that this lack of outliers does not adversely affect subsequent analysis, which focuses on statistically significant trends across high- and low-scoring biological processes rather than the individual processes themselves.

Figure 2
figure 2

Predictive power of TF binding and HM+DNase-based models. These models are of mRNA transcript abundance for 1,880 sets of mESC genes grouped by ontology-classified biological processes. Sets of genes exhibiting significant HM+DNase-to-TF adjusted R2 ratio (i.e., for which HMs are more predictive of transcript abundance) are indicated in red, with those exhibiting a significant TF-to-HM+DNase adjusted R2 ratio (i.e., for which TF binding is more predictive) are indicated in blue. The overlap between the significant (Benjamini–Hochberg-corrected P < 0.05 [43]) and non-significant (grey) regions is due to the ratio significance threshold varying with the number of genes belonging to each group. HM, histone modification; TF, transcription factor; TFAS, transcription factor association strength.

Our results suggest that the genome-wide redundancy reported in previous studies [15, 16] is not indicative of functional redundancy, but rather arises from averaging over heterogeneous groups of genes subject to different regulatory logic. Moreover, the observation that HM and DNase data were significantly more accurate in predicting the mRNA transcript abundance of mESC genes contributing to 28% of biological processes, suggests the existence of a small number of genes for which TF binding is considerably more informative. If under-represented in the processes examined, these genes would introduce negative bias into the null distribution (and therefore increase the number of statistically significant outliers) when randomly sampled during bootstrapped significance testing. This is also consistent with the order-of-magnitude fewer processes (1.3%) for which contributing mRNA transcript abundance was more accurately predicted by TF-binding data.

Redundancy in transcriptional regulation is dependent upon enrichment for housekeeping genes

The set of genes for which TF binding is considerably more predictive of gene expression than HM and DNase data was comparatively small (Figure 2). Inspecting the biological processes with high TF-to-HM+DNase and HM+DNase-to-TF adjusted R2 ratios for their constituent genes, it is apparent that the former are enriched for housekeeping tasks (e.g., ncRNA processing and RNA splicing) and the latter for tissue and context-specific processes (e.g., signal transduction and regulation of cell differentiation). To investigate whether these two groups of genes can be characterised accordingly, the top 100 processes from each list were tested for enrichment of housekeeping genes.

Figure 3 presents the housekeeping-gene enrichment of biological processes with the top 100 TF-to-HM+DNase and HM+DNase-to-TF adjusted R2 ratios for both (a) mESCs and (b) GM12878 cells. In both cases, the proportion of housekeeping genes contributing to biological processes is significantly larger for the top 100 TF-to-HM+DNase group (Welch’s t-test (a) P < 2.2 × 10-16 and (b) P < 2.6 × 10-6[44]). These results suggest TF binding provides more information regarding the transcriptional regulatory state of mammalian biological processes enriched for housekeeping genes.

Figure 3
figure 3

Proportion of housekeeping genes contributing toward key biological processes. These processes have the top 100 TF-to-HM+DNase (TF) and HM+DNase-to-TF (HM+DNase) adjusted R2 ratios for (a) mESCs and (b) GM12878 cells. The proportion of housekeeping genes is significantly larger for the TF group in both cases (Welch’s t-test (a) P < 2.2 × 10-16 and (b) P < 2.6 × 10-6). This suggests that TF binding provides more information regarding the transcriptional regulatory state of mammalian biological processes enriched for housekeeping genes and conversely that HMs and DNase provide more information for tissue and context-sensitive processes. HM, histone modification; TF, transcription factor; TFAS, transcription factor association strength.

Our findings can be explained by considering the spatial chromatin structure, as housekeeping genes are known to maintain constant ubiquitous expression by co-location in regions of actively transcribed open chromatin (e.g., at the boundaries of euchromatin and heterochromatin, topologically associating domains and larger A and B compartments [6, 45, 46]). These regions are maintained primarily by the activity of boundary proteins (e.g., CTCF [47]) rather than histone acetylation and methyl-recognising co-factors [48]. They are also significantly depleted for histone H1 (responsible for solenoidal chromatin packing [49]) and exhibit overall nucleosomal sparsity to provide unrestricted TF accessibility [50]. As HM ChIP-seq data was not normalised to nucleosome density (not assayed), this combination of nucleosome sparsity and TF-modulated chromatin structure presumably explains why HMs provide comparatively little information regarding the regulatory state of housekeeping genes. As TF–DNA complexes in open regions are capable of remaining stable throughout multiple rounds of transcription [51, 52], it is also unsurprising that a snapshot of local TF binding would provide more information regarding housekeeping mRNA transcript abundance than in the case of dynamically modulated chromatin.

Conclusions

Predictive modelling frameworks (recently reviewed in [14]) have the potential to fill an important gap between thermodynamically driven models of individual transcription regulatory events [53, 54] and association-driven network models of indirect gene regulation (e.g., those represented in the DREAM challenges [55]). Rather than modelling the regulation of specific genes, they can lead to more general conclusions regarding the roles and interactions of TFs, HMs and other key regulators of gene expression. Furthermore, they avoid the common issue of an underdetermined system by treating individual genes as observations of transcriptional regulatory logic in action, rather than variables in an association-driven analysis [56].

Recent predictive modelling studies have identified statistical redundancy between the regulatory roles of TFs and HMs [15, 16]. These findings outwardly contradict our understanding of transcriptional regulation, in which TFs and HMs play complementary yet distinct roles in RNAP-II recruitment and elongation. Moreover, there have previously been minimal attempts to resolve this contradiction (or even to distinguish between statistical and functional redundancy), potentially leading readers to perceive this modelling framework as one prone to capturing invalid biology. For the above reasons and to enhance our understanding of transcriptional regulatory logic, we believe that it is important to identify the underlying cause of this recurring statistical redundancy.

In this study, we validated the robustness of previous findings across multiple mammalian cell types and using different modelling algorithms. We extend this analysis by constructing individual models for thousands of ontology-classified biological processes, identifying significant variation in the relative predictive power of TFs and HMs across processes (i.e., the redundancy observed at the genome-wide level breaks down at this resolution). Importantly, this resolves the paradox between the distinct regulatory roles of TFs and HMs and the statistical redundancy within and between these elements.

Our investigation has highlighted several examples of simple predictive models yielding complex results that are consistent with our current understanding of fundamental molecular biology. With the hindsight provided by recent surveys of spatial genomic domains based on chromatin conformation capture [6], we can identify the signature of housekeeping genes localised to nucleosome-sparse domain boundaries by our inability to predict their expression using HM ChIP-seq data. Similarly, the statistical redundancy between TFs corresponds with the recently established notion of a transcription factor hierarchy, whereby the binding of a pioneer TF initiates a sequence of cooperative binding events that results in chromatin remodelling and/or RNAP-II recruitment [57]. The well-characterised crosstalk between TFs and HMs in regulating transcriptional initiation and elongation is also reflected in the statistical redundancy between TFs and HMs, and importantly our results have demonstrated that such correlation is unlikely to imply functionally redundant regulatory roles. These outcomes highlight the potential of predictive modelling as a powerful explorative framework for integrating heterogeneous genome-wide datasets to elucidate novel biology, and we encourage other researchers to incorporate such models in their own analysis pipelines.

Methods

Data availability and implementation

All Homo sapiens (GM12878 lymphoblastoid cell line) and Mus musculus (embryonic stem cell) data used in this study are detailed in Tables 2 and 3. All data and scripts are available online [58].

Table 3 Homo sapiens (GM12878 lymphoblastoid cell line) data

Calculation of transcription factor–gene association strength

For each gene i and TF j, ChIP-seq binding data for j was integrated to calculate a transcription factor association strength (TFAS), a ij [14, 15, 18]:

a ij = k g k e - d k d 0 ,
(1)

where g k is the height (mapped tags) of the k th TF-binding peak, d k is the distance (in base pairs) separating the k th peak from the TSS of gene i, and d0 is the empirical decay rate derived from the approximate average widths of ChIP-seq peaks (d0 = 5,000 for all TFs except E2f1 (d0 = 500) [14]). Binding sites further than 30,000 bp from the TSS were not considered as their weighted contribution is negligible. The TFAS matrix A was log-transformed and quantile-normalised [63].

An alternative formulation of the TFAS involves simply summing the number of mapped ChIP-seq tags either side of the TSS [16, 17] (e.g., -4 to approximately 4 kbp). The exponentially decaying formulation was chosen as it corresponds with the observed sharpness of ChIP-seq TF-binding peaks about the TSS and yields more accurate predictions of mRNA transcript abundance [16].

Calculation of histone and DNase scores

For each pair of gene i and HM j, the number of mapped ChIP-seq tags for j was summed within a region 2,000 bp either side of the TSS of i to calculate a histone score, b ij [14, 15]:

b ij = k g k ,
(2)

where g k is the number of ChIP/DNase-seq reads for j mapped to position k relative to the TSS of i. A region 2,000 bp either side of the TSS was chosen for consistency with previous studies [1517, 19]. An equivalent method was applied to DNase-seq tags to produce a DNase score. The concatenated histone and DNase score matrix B was log-transformed and quantile-normalised as per the TFAS matrix A[63]. Unlike TFs, HMs do not exhibit sharp, well-defined ChIP-seq peaks about the TSS. This prevents the formulation of histone and DNase scores equivalent to Equation 1 [16].

Regression models for predicting mRNA transcript abundance

Predictive models of mRNA transcript abundance were constructed using two regression techniques: log-linear regression and SVR [64], as illustrated in Figure 4(a). Both techniques have been previously applied to modelling transcript abundance as a function of transcriptional regulatory elements and demonstrated to yield comparable predictive performance [1519]. However, as log-linear regression and nonlinear SVR have previously been applied either to independent datasets or with different TFAS/histone score formulations, it remains unclear which is more appropriate for transcript abundance modelling.

Figure 4
figure 4

Flowchart illustrating the experimental pipeline presented in this study. ChIP/DNase-seq data were used to construct regression models of mRNA transcript abundance for a set of genes. The prediction accuracy of each model was evaluated relative to RNA-sequencing data. By constructing groups of genes categorised by biological process and applying the above methodology, it was possible to identify heterogeneity in the relative predictive power of TFs and HMs. These groups were later analysed for enrichment for housekeeping genes.

Log-linear regression

The log-linear regression model describing mRNA transcript abundance as a function of TF binding, HMs and DNase was formulated as:

log( y i +σ)=μ+ j β j a ij TF binding + k γ k b ik HM+DNase + ε i ,
(3)

where y i is the mRNA transcript abundance of gene i, μ is the basal transcript abundance and ε i is the gene-specific error term [14, 15, 18]. For a gene i and TF j, a ij is the TFAS defined in Equation 1 and β j is the fitted coefficient. Similarly, b ik is the histone (or DNase) score for a gene i and HM (or DNase) k defined in Equation 2 and γ k is the fitted coefficient. The constant σ is fitted from a 20% held-aside dataset to avoid evaluation of log(0). A model considering only the TFAS or HM+DNase score data can be constructed by excluding the HM+DNase or TF-binding components from Equation 3, respectively. The linear regression implementation from the stats R package was used.

Support vector regression

The SVR model describing mRNA transcript abundance of a gene i, y i , as a nonlinear function of a TFAS and/or HM+DNase score matrix, X, can be formulated as:

y i =μ+ i α i K( X i ,X)+ ε i ,
(4)

where K (·) is the kernel function and α represents the difference between the Lagrange multipliers fitted using a constrained quadratic optimisation process (described in Additional file 6). The ε-SVR with radial basis kernel function implementation from the e1071 R package was used with default parameters.

Identifying heterogeneity in predictive power

Previous models of mRNA transcript abundance have focused on modelling mRNA transcript abundance genome-wide [1519]. To investigate whether the relative performance of TF-binding and HM+DNase-based models varies across smaller sets of process-related genes, the Gene Ontology biological process annotations for the available set of 17,517 mESC and 38,041 GM12878 genes were considered [34]. For each process, a set containing all genes annotated with that term or any descendant term was constructed. Sets containing fewer than 50 or greater than 10,000 genes were discarded, yielding 1,880 and 1,965 sets of genes (for mESC and GM12878 respectively) for analysis.

Two regression models were constructed for each set of genes: one considering only TF-binding data and the other considering only HM and DNase data. The performance of the fitted models was evaluated as an adjusted R2 score, which captures the proportion of variation in measured mRNA transcript abundance for those genes explained by the model. Unlike the R2 score (i.e., the coefficient of determination, equivalent to the square of the Pearson correlation coefficient and previously used to evaluate models of mRNA transcript abundance [16, 17]), the adjusted R2 prevents spurious inflation due to the introduction of additional explanatory variables [65]. The ratio of adjusted R2 values for the TF and HM+DNase models was calculated to capture their relative performance.

A bootstrapped non-parametric test was conducted to identify sets of genes exhibiting a significant adjusted R2 ratio, as illustrated in Figure 4(b). Specifically, for each biological process annotating n genes, 5,000 sets of n genes were randomly sampled from the available 17,517/38,041 (mESC/GM12878) to generate a corresponding distribution of adjusted R2 ratios under the null hypothesis. From this distribution, a non-parametric P value was calculated using an empirical cumulative distribution function approximation [66]. Statistically significant P values were identified by applying a Benjamini–Hochberg correction with a false discovery rate of 0.05 [43].

Identifying enrichment of housekeeping genes

If the gene-specific residual, ε i ̂ , for a gene, i, is sufficiently large, it follows that the relationship between mRNA transcript abundance and the transcriptional regulatory elements described by the corresponding regression model does not hold for i. These poorly fitted genes were removed to identify enrichment of housekeeping annotation amongst sets of genes sharing common regulatory profiles. For each of the biological processes subsequently found to exhibit a statistically significant adjusted R2 ratio for the TF and HM+DNase models, genes exhibiting a studentised residual magnitude | ε i ̂ |>1 were therefore discarded [67]. The remaining genes were tested for significant enrichment of housekeeping annotation using the bootstrapped non-parametric test methodology described above.