Background

Gene expression in multicellular eukaryotes varies considerably between tissues and can change dramatically even within the same cell type. Regulation of transcription is one of the fundamental mechanisms for controlling the observed diversity in gene expression [1, 2], and recent studies have underscored the importance of combinatorial regulation by multiple transcription factors (TFs) in this regard [36]. Progress is also being made towards experimental methods for testing combinatorial regulators on a larger scale in near-physiological conditions [7]. Combinatorial regulation can explain, in general, how a relatively small number of TFs can govern gene expression under diverse conditions.

One such example is the regulation of gene expression in immune responses. Pathogen recognition in the vertebrate innate immune system is initially performed by a limited number of pattern-recognition receptors (PRRs). The Toll-like receptors (TLRs) are a family of PRRs responsible for the recognition of a wide variety of pathogen-associated ligands, such as lipopolysaccharide, viral RNA, unmethylated CpG DNA and so on. The recognition of ligands activates signaling pathways leading to the activation of several TFs, such as NF-κB, and IRFs. These TFs are known to induce expression of various genes and evoke pleiotropic immune responses. Although several studies have addressed the importance of combinatorial transcriptional regulation in TLR signaling [8, 9], little is known about which specific combinations of TFs are involved.

Various bioinformatics strategies have been used for the prediction of cooperation between TFs. Some studies have used a combination of features, such as co-expression data and protein-protein interactions [10, 11]. Other studies utilized a combination of chromatin immunoprecipitation combined with microarray (ChIP-chip) data and expression data [1216]. However, while the integration of heterogeneous experimental data sources is potentially very powerful, in practice such data is too scarce to be of use to a particular tissue of interest, especially for higher eukaryotes like humans and mice. ChIP-chip data, in particular, is available for a very limited number of TFs, in a limited number of cell types. In addition, in the case of de novo predicted regulatory motifs, it might not be known what protein (if any) is binding the motif in question, which restricts the applicability of ChIP-chip analysis. For these reasons, a number of studies have focused on identifying combinatorial regulation solely based on predicted transcription factor binding sites (TFBSs). For example, Murakami et al. used position weight matrices (PWMs) to predict TFBSs on a genomic scale in order to quantify the co-occurrence of regulatory motifs in human promoters [17]. Sudarsanam and colleagues used a cumulative hypergeometric distribution to predict regulatory motifs co-occurring on a genome-wide scale in yeast [18]. Other studies have described measures for co-occurrence of pairs of motifs as a measure to predict TF synergy [19]. Synthetic libraries of promoters have been used to study combinatorial regulation using thermodynamic models [20], and more recently, combinations of oligomers have been used to predict from sequence EP300-bound and CREBBP-bound enhancers in three mammalian cell types [21].

A small number of studies have attempted to identify pairs of co-occurring motifs in the promoters of co-expressed genes [22, 23]. However, methods for predicting combinatorial regulation from predicted TFBSs are plagued by a number of problems. These include similarities beween the PWMs used to predict TFBSs, biases caused by motif over-representation, and difficulty of evaluating the significance of observed co-occurrences using standard statistical tests.

In this study, first we describe a new measure for TFBS pair co-occurrence. For each PWM pair (A,B), we calculate the frequency of motif B in sequences containing one or more A sites, as well as the frequency of motif B in sequences that lack A sites. We use the ratio of these two frequencies, the frequency ratio (FR), as a measure for co-occurrence. Applying this measure on the TFBSs in the genomic set of human and mouse promoters, we observed how co-occurrence tendencies are strikingly different between promoters with high GC content and CpG scores and promoters with low GC content and CpG scores, with the latter having a higher variety in FR values. We also observed a strong influence of TFBS GC content differences.

Based on the above observations, we developed a method for predicting co-regularing pairs of TFs in a set of co-expressed genes. Given the promoter sequences for a set of genes that are co-expressed, we identify motif pairs that co-occur more often than expected. We use the relative increase in co-occurence in the co-expressed set of genes as an indicator of combinatorial regulation.

Our proposed method was designed to overcome the problems associated with previously reported statistics-based measures of co-regulation. In order to obtain a measure of statistical significance, we compare the observed FR values for pairs of motifs in a set of co-expressed genes with those of sets of genes sampled at random, thus taking into account biases caused by genome-wide co-occurrence tendencies. We applied our approach to a number of sets of co-expressed mouse genes, and found a number of significantly co-occurring PWMs pairs. Importantly, the proposed approach was not biased by TFBS motif over-representation, and could thus detect co-occurrences missed by existing approaches. For the identified TF pair NF-κB - C/EBPαwe experimentally validated the co-regulation after TLR stimulation in dendritic cells. Since the proposed method does not rely on ChIP-chip data, it is generally applicable and can complement existing computational methods for discovery of TF co-regulation.

Methods

We refer to Additional file 1 for a workflow of our framework for the detection of co-occurring motifs.

Promoter sequences

We used a combination of DBTSS data [24], CAGE data [25], and annotation data from the UCSC Genome Browser [26] to define transcription start site (TSS) positions for both human and mouse genes, as described before [27]. The regions from -1000 to +200 were extracted from the repeat-masked hg18 and mm9 versions of the human and mouse genome. For each pair of highly similar sequences (BLAST E value < 1e-70, threshold decided after visual inspection of alignments) one sequence was removed from our sequence dataset in order to reduce biases caused by duplicated sequences.

Position weight matrix dataset

From the TRANSFAC [28] and JASPAR [29] databases all vertebrate PWMs were extracted. Redundancies were removed using tomtom [30] by the following strategy: for each pair of similar PWMs (tomtom E value < 1, and overlap between motifs > 75% of each motifs length) the motif with the lowest information content was removed from our dataset. Pairs were considered in order of increasing tomtom E value. This resulted in a PWM dataset of 199 non-redundant PWMs, each representing a group of similar PWMs. For each PWM a score threshold was set in a way that there is about 1 hit per 5000 bps in the mouse promoter sequences. GC content values of PWMs were calculated as the average of the probability of nucleotides C and G over all positions of the PWMs.

Measure for TFBS co-occurrence: frequency Ratio

As a measure of TFBS co-occurrence we introduce the Frequency Ratio (FR) value. Consider two TFs, TF A and TF B, whose binding preferences are represented by PWM A and PWM B respectively. Given a set of sequences and the predicted sites for both PWMs, we calculate the FR(B|A), the tendency of sites for TF B to co-occur with those of TF A, as follows. First, we define seq(A) as the number of sequences containing at least one site for motif A, and n(B|A) as the number of sites for motif B co-occurring with one or more sites for motif A. From these we calculate frequency(B|A), a measure for the number of B sites co-occurring with A sites:

f r e q u e n c y B | A = n B | A / s e q A
(1)

Likewise, we define frequency(B|!A) as the number of predicted sites for motif B per sequence lacking sites for motif A:

f r e q u e n c y B | ! A = n B | ! A / s e q ! A
(2)

where n(B|!A) is the number of B sites in the set of promoters lacking A sites and seq(!A) is the number of sequences without A sites. We calculate the ratio of these two frequency values, FR(B|A):

F R ( B | A ) = n ( B | A ) s e q ( A ) n ( B | ! A ) s e q ( ! A ) = n u m b e r o f B s i t e s p e r s e q u e n c e h a v i n g 1 A s i t e n u m b e r o f B s i t e s p e r s e q u e n c e h a v i n g n o A s i t e
(3)

FR(B|A) is a measure for the tendency of sites of motif B to be present in sequences having at least one A site compared to sequences not having an A site. In order to limit the bias caused by overlapping sites for pairs of similar motifs, sites for motif B overlapping A sites are not included in n(B|A). In the case of homotypic motif pairs (where motif B and motif A are the same motif), there are obviously no motif B sites in sequences not containing sites for motif A. In this case we define frequency(A|!A) = 1. In this case FR(A|A) can be interpreted as the average number of A sites in sequences containing at least one A site. Note that the FR measure is not limited to TFBS motifs, but can be used for other sequence motifs and nucleotide oligomers.

Micro-array gene expression data

We used micro-array expression data for a large number of human and mouse tissues [31], and for dendritic cells (DCs) after stimulation with a number of immune stimuli [9] (GSE17721). The raw intensity data were processed to calculate robust multi-array average (RMA) values. Genes with at least 3-fold differential expression between any pair of samples were picked up. Expression values for each gene were rescaled to mean 0 and standard deviation 1, and dimension reduction was performed with principal component analysis. The gene expression patterns were hierarchically clustered with Ward's algorithm and divided into subclusters. All analyses were performed using R and Bioconductor.

Definition of CpGhigh promoters and CpGlow promoters

We based our definition of CpGhigh and CpGlow promoters on previously reported definitions for CpG islands [32]. For each promoter sequence we defined the ratio of each nucleotide as the number occurrences of that nucleotide divided by the number of non-masked nucleotides in the sequence. The GC content is then the sum of the C ratio and the G ratio. The CpG score is the observed ratio of CpG dinucleotides divided by the ratio expected from the ratio of C and G nucleotides. We defined "CpGhigh promoters" as promoters with a GC content ≥ 50% and a CpG score ≥ 0.6. Promoters not meeting these requirements were labeled as "CpGlow promoters". These two conditions gave us 6750 CpGhigh promoters in mouse (37% of total) and 9029 CpGhigh promoters in human (50% of total).

Detection of over-represented TFBSs

As a measure for over-representation of a TFBS motif in a set of sequences we use the Over-Representation Index (ORI) as defined by Bajic et al. [33]. P-values for ORI values were assigned by random sampling sets of sequences of the same size as the set of interest from the genomic set of promoter sequences. For the estimation of p-value in CpGhigh promoters and CpGlow promoters the sampling was done from the genomic set of CpGhigh promoters and the genomic set of CpGlow promoters, respectively. As p-value threshold for over-representation we used the threshold of 0.01.

Co-occurrence significance in promoters of co-regulated genes

For each co-expressed set of genes containing at least 50 genes, we identified over-represented TFBSs, as described above. For each pair of motifs (A,B), where A is an over-represented motif and B is any of the 199 PWMs, we calculated FR(B|A) set in the promoter sequences corresponding to the set of co-expressed genes. The significance of FR(B|A) set was evaluated using a sampling approach. In this sampling approach, a set of sequences are sampled from the genomic set of sequences until the same number of B sites and A sites as in the co-expressed cluster is obtained. From these, a FR(B|A) sampled is calculated, reflecting the genome wide co-occurrence tendencies of the pair of motifs. This is repeated a large number of times (105 times in this study), and the p-value of FR(B|A) set is defined as the ratio of the number of times where FR(B|A) sampled FR(B|A) set .

Generation of artificial and semi-artificial promoter sequences

Artificial promoter sequences were constructed by generating sequences of the same length as the actual promoter sequences used in this study, where at each position the nucleotide is decided using a uniform distribution over the alphabet (A,C,G,T). Semi-artificial sequences were generated by scanning through actual sequences and randomly adding either a G or C to the semi-artificial sequence when a G or C was encountered in the actual sequence; and randomly adding either an A or T when an A or T was encountered.

Construction of plasmids, transfection, and luciferase assay

Promoter sequences of selected genes were PCR amplified and cloned into pGL3-basic vectors (Promega). Sequences from about -1200 to +250 relative to transcription start site were cloned. kB-tandem reporters were purchased from Promega. Complementary DNA for TFs was PCR amplified and cloned into pEF-BOS expression vectors. The resulting reporter plasmids and TF over-expression plasmids were co-transfected into HEK 293 cells with pRL-TK encoding Renilla luciferase (Promega) and appropriate signaling molecules with using Lipofectamin 2000 (Invitrogen). At 24 hours after transfection, the cells were lysed and subjected to reporter assay according to the manufacture's instruction (Promega). The primers used will be provided upon request.

Results and discussion

Frequency ratio, a novel measure for co-occurrence of two TFBSs: general results and genomic tendencies

As a measure for the co-occurrence of the TFBSs for two TFs, TF A and TF B, we propose the Frequency Ratio, FR (see Methods section). The FR(B|A) value is a measure for the tendency of motif B to co-occur with motif A. On a molecular level, it reflects the tendency of TF B to bind the same promoters as TF A, although this does not necessarily imply a direct physical interaction between A and B. Cases where FR(B|A) values are higher than 1 reflect a tendency of TF B to bind promoter sequences that also are bound by TF A, while FR(B|A) values lower than 1 reflect a tendency for TF B to bind to promoter sequences not bound by TF A. To avoid biases caused by motif similarities, sites where motifs A and B overlap were discarded before the calculation of the frequency values. Note that FR(B|A) is not necessarily the same as, or similar to, FR(A|B) (Supporting text in Additional file 2).

Using the above definition of FR we calculated the genome-wide FR values for all 39,601 (199 × 199) TFBS motif pairs, in the genome-wide sets of 18,218 human promoter sequences, and 18,168 mouse promoter sequences. A histogram of FR values in the genomic set of mouse promoters is shown in Fig. 1A. Although the majority of PWM pairs have FR values close to 1 (84.9% of the pairs have a FR value between 0.7 and 1.3), some pairs have high or low FR values. Similar observations were made for human sequences (Fig. S2A in Additional file 3). The outliers with large or small FR values indicate the genome-wide tendencies for high or low co-occurrence of sequence motifs, respectively. These genome-wide tendencies represent reference values to which we will compare the FR values of particular sets of co-expressed genes.

Figure 1
figure 1

Genome-wide tendencies of Frequency Ratios. (A) Histogram of FR values for all PWM pairs in the genomic set of mouse promoter sequences. (B,C,D) Plots of GC content differences as a measure of PWM-to-PWM dissimilarity (Y-axis) versus FR values (X-axis, same as in A), for all promoters (B), CpGhigh promoters (C), and CpGlow promoters (D).

Similar sequence motif pairs tend to be co-occurring

Next, we analyzed the correlation between FR values and motif-motif similarity. We used the difference of GC content between pairs of motifs as an indicator of motif similarity. Fig. 1B shows a plot of the difference in GC content between pairs of motifs versus FR for the genomic set of mouse promoter sequences. This figure clearly shows that motif pairs with a smaller difference in GC content tend to have higher FR values, while motif pairs with different GC content tend to have lower FR values. A similar tendency was obtained in human promoter sequences (Fig. S2B in Additional file 3). Since we excluded overlapping sites, the tendency of these motifs to co-occur cannot be explained simply by a tendency of sites for similar PWMs to overlap with each other. In semi-artificial promoter sequences, where overall GC content and local GC content fluctuations were identical to those of real promoter sequences, a similar tendency was observed (84.7% of the pairs had a FR value between 0.7 and 1.3; Fig. S3A in Additional file 4). On the other hand, in completely artificial sequences with 50% GC content, this tendency was not observed: the vast majority of motif pairs had FR values close to 1 (99.0% of the pairs had a FR value between 0.7 and 1.3; Fig. S3B in Additional file 4).

CpGlow promoters have a higher variety of FR values than CpGhigh promoters

Given the observed influence of GC content on co-occurrence, we decided to separately investigate the tendencies in CpGhigh and CpGlow promoters. The genome-wide set of promoter sequences was divided into a set of CpGhigh promoters and CpGlow promoters (see Methods section), and the FR values in each set were calculated. Figure 1C shows the difference in GC content between pairs of motifs versus their FR values in the genome-wide CpGhigh promoter set in mouse. From this figure we can see that for the CpGhigh promoter set, the tendency for motif pairs with a smaller (larger) difference in GC content to have higher (lower) FR values was not observed. In contrast, for the CpGlow promoters (Fig. 1D) such a tendency was clearly observed. These trends were also found in human sequences (Fig. S2C,D in Additional file 3) and semi-artificial promoters sequences (Fig. S3C,D in Additional file 4). Assuming that the variety in FR values reflects the potential of sequences to encode combinatorial regulation, these results suggest that the regulatory complexity of CpGlow promoters is higher than that of CpGhigh promoters.

An additional result supporting the notion that CpGlow promoters have a higher potential for combinatorial regulation was obtained from the analysis of FR values of 200 randomly selected oligomers. Although these oligomers are likely not to be associated with any regulatory motifs, the tendencies of FR values are similar to those we observed for PWM motifs in CpGhigh and CpGlow sequences (Fig. S4 in Additional file 5).

Only few motif pairs have high or low FR values on a genome-wide level

The above observations raise the question to what extent genome-wide FR values are indicative of combinatorial regulation between pairs of TFs on a genome-wide level. We compared FR values observed in the genome-wide set of promoter sequences with those observed in 10 sets of semi-artificial sequences. To take into account the influence of GC content difference, we divided pairs of PWMs into 10 bins according to their pairwise GC content difference. Table 1 shows all pairs of PWMs with high and low tendencies to co-occur on a genome-wide scale (FR values higher/lower than 99.99% of the FR values observed for PWM pairs with similar GC content differences in semi-artificial sequences).

Table 1 PWM pairs with high and low FR values in the genomic set of promoters.

Two important observations can be made. The first point is that out of the 39,601 pairs of PWMs, only very few have exceptionally high/low FR values (Table 1). In mouse promoters 14 pairs, and in human promoters 2 pairs of PWMs had exceptionally high values, while in mouse promoters 2 pairs had low values. In human and mouse CpGhigh sequences and mouse CpGlow sequences, no pairs with exceptional values were found. These indicate that the vast majority of FR values in true sequences are within the range of values we can expect to find in semi-artificial sequences lacking any biological meaning. A second point is that, interestingly, most of the pairs with high values involve the TBP motif (TATA box), a core promoter motif. This motif is thought to be typically present in non-CpG island-associated promoters of genes with relatively strictly regulated transcription initiation. On the other hand, we found the GC-rich and CpG-rich E2F TF motif, which might be indicative of CpG-rich sequences, to have low co-occurrence with a regulatory motif (TFE).

In conclusion, the above observations support the hypothesis that the genome-wide variation in FR values, except for those involving a small number of exceptional sequence motifs such as the TATA box and GC-box, is mainly a result of sequence variations, and not an indication of genome-wide combinatorial interactions between TFs. It is important that these genome-wide biases are taken into account by approaches that predict combinatorial regulation in smaller sets of co-regulated genes.

The FR approach allows for detection of co-occurring regulatory motifs in tissue-specific promoter sequences without bias caused by TFBS over-representation

Next, we turned our attention to the problem of finding significantly co-occurring motifs in the promoter sequences of sets of co-expressed genes. We used gene expression data for a large number of mouse tissues [31] to define clusters of co-expressed genes (Fig. S5 in Additional file 6, Table S1 in Additional file 7), and applied our approach to each cluster containing at least 50 genes (see Methods). Below, we present some of our findings for a set of 155 genes with high expression in mouse liver and kidney. Of these, only 13 genes were associated with a CpGhigh promoter. In the following discussion we focused on the CpGlow promoters. The most over-represented motifs in this set of promoters were those for PWMs for HNF1, HNF4, and a number of nuclear receptors (Table S2 in Additional file 8). These factors and their importance in liver-specific regulation of transcription have been widely reported [3437].

As described in Methods, the significance of co-occurrence for each A-B TF pair present in the set of sequences was estimated by sampling sequences from the genomic set of sequences until the same number of A and B sites were obtained, and the p-value of FR(B|A) set was computed. In the kidney/liver-specific set we found 11 co-occurrences with p-values < 0.01. In one case, the PWM for FOXP1 co-occurred significantly with HNF1 sites. FOXP1 plays a role in the development of various organs, including liver. In this set of promoters, 17 out of 21 non-overlapping FOXP1 sites co-occurred with HNF1 sites, yielding a FR value of 5.5 (Fig. 2A). Given the genomic FR value (1.6), we would expect only about 10.7 FOXP1 sites to co-occur with HNF1 sites on average (see Methods, from Eq. 3). Moreover, the distance between FOXP1 sites and HNF1 sites was biased towards proximal positioning: of the 26 FOXP1-HNF1 site pairs, 14 were separated by less than 200 bps (Fig. 2B,C). In addition, visual inspection of the site pairs revealed a preference of the FOXP1 site to be upstream of the HNF1 site.

Figure 2
figure 2

Predicted HNF1 and FOXP1 binding sites in the promoters of liver- and kidney-specific genes. (A) Visual representation of predicted TFBSs for HNF1 (green) and FOXP1 (red) in promoters containing at least 1 predicted site for each TF. Gene symbols are indicated at the left. (B) Histogram of distances between pairs of HNF1 and FOXP1 sites in the promoters of liver- and kidney-specific genes. (C) Histogram of distances between pairs of HNF1 and FOXP1 sites in the genomic set of promoters.

To illustrate the difference between our approach and approaches based on statistical tests, we calculated co-occurrence p-values using the method of Yu et al. [22], and using the method of Sudarsanam et al. [18]. The approach by Yu et al. evaluates co-occurrences using two p-values, one for co-occurrences, P occ , and one for the bias in distances between pairs of sites, P d . Here we focused on P occ , the probability of observing an equal or greater number of co-occurrences, calculated based on the number of sequences in the co-regulated set versus the size of the genome-wide set, the number of co-occurrences between two motifs in the genome-wide set, and the number of co-occurrences in the co-expressed set. The approach by Sudarsanam et al. uses a cumulative hypergeometric model to evaluate the significance of the observed number of co-occurrences for a motif pair, by comparing it to the distribution of expected co-occurrences given the number of occurrences of the individual motifs. We applied our FR approach, the P occ approach, and the Sudarsanam approach on all sets of co-expressed genes, and compared the results in terms of the over-representation of co-occurring motifs. Fig. 3 shows that the distribution of ORI p-values for all 1294 PWMs co-occurring significantly with an over-represented motif is similar to that of all PWMs, confirming that the FR approach is not biased by motif over-representation. Indeed, the majority of predicted co-occurring motifs are not over-represented. In contrast, the distribution of ORI p-values of predicted co-occurring motifs in the top 1294 pairs as predicted by P occ , showed a strong bias towards lower ORI p-values, indicating that this method is strongly biased by motif over-representation. The fact that with increasing motif over-representation the expected number of co-occurrences modeled by the hypergeometric distribution also increases, makes the approach described by Sudarsanam et al. [18] relatively robust against the bias caused by motif over-representation, but less so than the FR measure. However, this method does not use a reference set of sequences during the evaluation of significance, making it the most easily affected of these three approaches by PWM-to-PWM similarities (as measured by GC content differences). A relatively high number of co-occurring pairs predicted by the approach by Sudarsanam et al., have similar GC content levels, and pairs of motifs with large differences in GC content are relatively rarely predicted to be co-occurring (Fig. S6 in Additional file 9).

Figure 3
figure 3

Over-representation p-values of co-occurring motifs predicted by two approaches. Co-occurrences based on P occ are strongly biased by PWM over-representation, while this is not the case for co-occurrences predicted using the FR measure. The distribution of ORI p-values as measure for PWM over-representation is shown 1) for all PWMs in all sets of co-regulated genes ("all motifs), 2) for the 1294 PWMs found to be significantly co-occurring with an over-represented motif according to FR values ("co-occurring motifs, FR"), 3) for the PWMs found to be co-occurring with an over-represented motif according to P occ ("co-occurring motifs, P occ "), and 4) for the PWMs found to be co-occurring with an over-represented motif according to the approach of Sudarsanam et al. ("co-occurring motifs, Sudarsanam"). For the latter two approaches the 1294 pairs with the most significant co-occurrence were used.

As an illustration, for the set of promoters of liver- and kidney-specific genes in mouse, the top co-occurrences in terms of P occ were strongly dominated by PWM pairs containing HNF1 and HNF4, which were both strongly over-represented in this cluster. In the top 20 motif pairs, 18 involved HNF1, which was found to have significant P occ values with most other over-represented motifs, such as those for HNF4 and Ikaros. The pair HNF1 - HNF4 had the lowest P occ value (2.06e-11). However FR( HNF4|HNF1) set was only moderately higher than FR( HNF4|HNF1) genomic (1.22 vs 1.01, p-value 0.25). Indeed, only 27 out of 62 (44%) HNF4 sites co-occurred with HNF1 sites, which were present in 60 out of 155 (39%) sequences in this cluster. Even though both motifs were over-represented in this cluster, they did not have a strong tendency to be present in the same sequences. The measure described by Sudarsanam et al. too, predicted a number of significant co-occurrences involving HNF1. Strikingly, the top 10 motifs predicted to co-occur with HNF1 motifs have similarly low GC content value as the HNF1 motif (mean difference in GC content: 5.7%), while this is not the case for the 10 motifs with most significant FR values with regard to HNF1 (mean difference in GC content: 20.9%). Collectively, these results indicate that P occ is more related to co-over-representation than to co-occurrence, and that the measure proposed by Sudarsanam et al. is relatively sensitive to GC content similarities. Our proposed approach, on the other hand, can find non-over-represented motifs, which are likely to be missed by traditional approaches, and is less influenced by PWM-to-PWM GC content similarity.

Identification of TFs co-occurring with NF-κB or IRF in gene sets having specific expression patterns on TLR stimulation

Several TFs such as NF-κB, IRF, and AP-1 are known to play roles in gene expression evoked by TLR signaling. However, little is known about involvement of other TFs and how these TFs orchestrate variety of gene expression patterns.

To elucidate the diverse gene expression patterns generated by combinations of various TFs, we next applied our method for the analysis on TLR signaling. We used a large scale microarray dataset on gene expression in bone marrow-derived DCs after stimulation with various TLR stimuli [9]. The gene expression values were calculated and hierarchically clustered into 18 clusters (Fig. 4A, and Fig. S7 in Additional file 10 for details).

Figure 4
figure 4

Gene expression patterns in TLR-stimulated dendritic cells and identification of a TF pair synergistically upregulating target promoters. (A) Hierarchical clustering of gene expression in DCs upon TLR-stimulation. A heatmap of gene expression is shown with clusters indicated by the colored banner besides the heatmap. (B) The Nos2 promoter was synergistically controlled by RelA or MyD88 with C/EBPα. Schematic representation of the Nos2 promoter structure is shown (red box, NF-κB sites; blue box, C/EBPα sites). The Nos2 promoter was cloned upstream of a luciferase encoding gene. The resulting plasmid, where luciferase is under the control of the Nos2 promoter, was transfected into HEK 293 cells with RelA or MyD88 and C/EBPα. After 24 hours, luciferase activity was measured as described in Methods. Error bars represent standard deviations of duplicate experiments. The data shown is a representative of three independent experiments with essentially identical results.

Frequency ratio analysis revealed that a wide variety of TFs had significant co-occurrence with each other (Table S3 in Additional file 11). Firstly, we looked into TFs co-occurring with the binding motifs of NF-κB or IRF, since ORI analysis showed that these two motifs are significantly over-represented in the clusters, suggesting their key roles in TLR signaling, consistent with former studies (Table S2 in Additional file 8). Several TFs were found to have significant co-occurrence with NF-κB or IRF (Table 2). NF-κB and IRF motifs were found to co-occur significantly with themselves (homotypic co-occurrences) in clusters 1, 9, and 13. Clusters 1 and 9 contained genes whose expression levels peaked at about 0.5 to 4 hours after stimulation, suggesting multiple NF-κB sites in promoters of those genes enable rapid induction (Fig. S7A in Additional file 10). Genes in cluster 13 were specific for poly (I:C) stimulation, and their expression peaked at around 6 hours after stimulation (Fig. S7B in Additional file 10). In addition to various IRF family TFs, the PWM M00063 represents Stat1, implying that these genes may be induced secondarily by type I interferon. Other well-known motifs found were c-Fos (MA0099), presumably representing the AP-1 motif. These collectively suggest that our method could identify known TF pairs involved in TLR signaling.

Table 2 Overview of the co-occurrences in TLR-stimulated DC gene expression patterns.

We found that in cluster 14, CCAAT enhancer binding protein alpha (C/EBPα) had significant co-occurrence with NF-κB (Table 2, p < 6.23e-3). C/EBP family transcription factors are reported to be involved in TLR signaling-induced gene expression such as cytokine gene expression [38]. Importantly, while P occ is relatively low for the pair NF-κB - C/EBPα, up to 30 out of 199 PWMs have a P occ < 0.01 for co-occurrence with the over-represented NF-κB motif in cluster 14 (data not shown), making P occ not useful for predicting co-occurrences in this case. A gene in the cluster, Nos2, has C/EBP motifs and NF-κB motifs in its promoter (Fig. 4B, upper scheme), thus the promoter activity is expected to be controlled by C/EBP and/or NF-κB. To check this, we cloned the Nos2 promoter, and its activity upon over-expression of TFs or signaling molecules MyD88 simultaneously with C/EBPα was checked by luciferase assay. The activity of the Nos2 promoter was up-regulated only by over-expression of C/EBPα, RelA (a major component of NF-κB [39]), or MyD88 (an adaptor protein of TLR signaling pathways [40]) (Fig. 4B), indicating these TFs positively regulate the Nos2 promoter. Moreover, when C/EBPα was over-expressed simultaneously with RelA or MyD88, luciferase activity increased compared to that on RelA or MyD88 over-expression alone. This result indicated that C/EBPα controls the expression of the Nos2 gene, and also suggested that it controls the expression of genes other than Nos2 in clusters 14.

Synergistic activation of TLR-regulated promoters by NF-κB and C/EBPα

The above results prompted us to check if a broader array of promoters is regulated by C/EBPα. We tested whether C/EBPα controls NF-κB-regulated promoters or not. Since four PWMs in our PWM set represent C/EBP TFs (M00249, M00622, M00201, and M00159), we picked up promoters having predicted TFBSs for NF-κB and one of the four motifs (Fig. 5). We also added as a positive control Arg2, which has a predicted NF-κB site and has been reported to be a target of C/EBPβ [41].

Figure 5
figure 5

C/EBPa regulates a set of promoters with NF-κB motif. (A) Promoters with the co-occurrence of NF-κB and C/EBPα sites were suppressed (Nfkbiz), or activated (others) by C/EBPα. (B) κB-tandem promoter was not activated but rather suppressed by C/EBPα expression. Promoter structures are shown as in Figure 4B. Error bars represent standard deviations of duplicate experiments. All the results shown are representative of three independent experiments with essentially identical results

Five promoters out of 6 tested were up-regulated synergistically by RelA and/or MyD88 and C/EBPα (Fig. 5A), whereas activation of one promoter was suppressed by C/EBPα over-expression (Nfkbiz). In contrast, the tandem-kB luciferase reporter was suppressed by C/EBPα over-expression (Fig. 5B), indicating a specific activation of promoters by C/EBPα. These results suggested that C/EBPα synergistically and specifically up-regulates the activity of a set of promoters regulated by NF-κB. Taken together, our method successfully identified a pair of transcription factors involved in the immune response.

Conclusions

In this study, we introduced a new measure for regulatory motif co-occurrence, and investigated genome-wide co-occurrence tendencies between pairs of regulatory motifs. Our initial results show that some motif pairs have a strong tendency to co-occur, while other pairs have a strong tendency to avoid co-occurrence. However, further investigation showed that these tendencies reflect GC content fluctuations in promoter sequences, rather than a genome-wide level of combinatorial regulation: semi-artificial sequences in which GC content fluctuations were identical to real sequences showed very similar trends. On the other hand, this trend was lost in completely artificial sequences. Also, we found that only few pairs of regulatory motifs had exceptionally high or low FR values in the genomic set of sequences as compared to semi-artificial sequences. Pairs that did show exceptionally high FR values often involved the TATA-box motif, which might reflect the tendency of strictly regulated non-CpG island-associated promoters to contain a TATA-box more frequently than other promoter sequences.

In addition, we showed that the tendencies were completely different between CpGhigh promoters and CpGlow promoters. Our measure for regulatory motif co-occurrences showed a relatively limited range in CpGhigh promoters compared to CpGlow promoters. Similar observations were made for a set of randomly selected oligonucleotide motifs. These observations reflect a fundamental difference between these two types of promoters. CpG islands have been reported to be associated with ubiquitously expressed genes and housekeeping genes, while genes not associated with CpG islands tend to be tissue-specific or condition-specific genes [42, 43]. On the promoter sequence level too, there are considerable differences: while the promoters of CpG island-associated genes tend to lack typical core promoter elements and tissue-specific TFBSs, the promoters of genes not associated with CpG islands tend to contain TATA boxes or other core promoter elements and TFBSs allowing their precise regulation of expression [44]. The apparent lack of high and low FR values in the genome-wide set of CpGhigh promoters might reflect a relatively low need for complex combinatorial regulation, compared to CpGlow promoters. On the other hand, for tissue- or condition-specific genes combinatorial regulation might be necessary to ensure spatio-temporal specificity, reflected in the larger range of FR values observed in CpGlow promoters.

Keeping the above observations in mind, in the proposed method, for the set of promoter sequences of interest, the significance of co-occurring pairs was estimated using a random sampling procedure. This approach thus takes into account the genomic tendency of motif pairs of similar structure to appear in the same promoter sequence. Furthermore, we considered CpGhigh promoters and CpGlow promoters as separate cases. A recent study on TFBS analysis has led to a similar recommendation [45]. Moreover, we excluded overlapping pairs of sites to avoid bias caused by similarity between motifs.

Applying our method to a large number of tissue-specific sets of mouse promoters, we could predict a large number of pairs of significantly co-occurring TFBS pairs. One example is the pair HNF1 - FOXP1, for which we found binding sites to be significantly co-occurring in the promoters of genes with specific expression in liver and kidney. Moreover, the TFBSs of this pair of TFs showed a tendency to be located proximally to each other, with the FOXP1 TFBSs located upstream of the HNF1 TFBSs. Importantly, our approach demonstrated improved robustness against biases caused by strongly over-represented motifs in comparison to a previously reported statistics-based method. Indeed, the majority of significant interactions we found involved motifs that were not over-represented (a considerable fraction was actually under-represented). This was also the case for C/EBPα binding sites in cluster 14 from the DC expression data. Such motifs would thus not be detected by standard over-representation analysis.

For one of the significantly co-occurring TF pairs involved in TLR signaling we could verify the predicted combinatorial regulation. We found that C/EBPα co-regulates a set of promoters with NF-κB. Co-regulation by NF-κB and C/EBP has previously been reported. Lcn2 and Arg2 were reported as targets of C/EBPβ [41], confirming that the Frequency Ratio could predict biologically meaningful TF pairs. Moreover, C/EBPα itself has recently been found to control a number of RelA-dependent inflammatory promoters, and NF-κB activation synergistically with PU.1 [46], further supporting our findings on the significance of the NF-κB-C/EBP pair. It would be interesting to check the involvement of other pairs identified as in Table 2 in TLR-induced gene expression patterns.

There is some room for possible improvements of our approach. First of all, epigenetic factors that might be responsible for tissue- or condition-specific expression should be taken into account. At present, chromatin remodeling data is still limited to a small number of cell types, which makes it difficult to incorporate in our approach. Undoubtedly, as the amount of available data increases, there will be a need to incorporate it, resulting in approaches combining both TFBS information and epigenetic information. Importantly, while we have limited our analysis to promoter sequences here, the role of distal enhancers in the regulation of transcription is generally accepted. As epigenetic data for various cell types increases, we will become able to apply our method not only to promoter regions but also to enhancers, and investigate potential differences in combinatorial regulation occurring in promoter and enhancers. Secondly, since our approach relies on TFBS prediction, which is still known to have a low specificity, further developments in the prediction of TFBSs, and additional genome-wide binding data is likely to improve our method as well as other TFBS prediction-based methods.

In conclusion, our FR approach circumvents biases which former methodology suffers from, and we could identify some meaningful co-occurring TFBS pairs, one of which was experimentally supported. We believe this approach can help us detect combinatorial interactions between TFs in the regulation of transcription, and we also believe that this sets a basis for future developments in computational identification of combinatorial gene regulation.

An online application of our method, which we call REgulatory MOtif COmbination Detector (REMOCOD), is available at our website [47].