Probing transcription factor combinatorics in different promoter classes and in enhancers
In eukaryotic cells, transcription factors (TFs) are thought to act in a combinatorial way, by competing and collaborating to regulate common target genes. However, several questions remain regarding the conservation of these combinations among different gene classes, regulatory regions and cell types.
We propose a new approach named TFcoop to infer the TF combinations involved in the binding of a target TF in a particular cell type. TFcoop aims to predict the binding sites of the target TF upon the nucleotide content of the sequences and of the binding affinity of all identified cooperating TFs. The set of cooperating TFs and model parameters are learned from ChIP-seq data of the target TF. We used TFcoop to investigate the TF combinations involved in the binding of 106 TFs on 41 cell types and in four regulatory regions: promoters of mRNAs, lncRNAs and pri-miRNAs, and enhancers. We first assess that TFcoop is accurate and outperforms simple PWM methods for predicting TF binding sites. Next, analysis of the learned models sheds light on important properties of TF combinations in different promoter classes and in enhancers. First, we show that combinations governing TF binding on enhancers are more cell-type specific than that governing binding in promoters. Second, for a given TF and cell type, we observe that TF combinations are different between promoters and enhancers, but similar for promoters of mRNAs, lncRNAs and pri-miRNAs. Analysis of the TFs cooperating with the different targets show over-representation of pioneer TFs and a clear preference for TFs with binding motif composition similar to that of the target. Lastly, our models accurately distinguish promoters associated with specific biological processes.
TFcoop appears as an accurate approach for studying TF combinations. Its use on ENCODE and FANTOM data allowed us to discover important properties of human TF combinations in different promoter classes and in enhancers. The R code for learning a TFcoop model and for reproducing the main experiments described in the paper is available in an R Markdown file at address https://gite.lirmm.fr/brehelin/TFcoop.
KeywordsRegulatory genomics Computational biology Transcription factors Promoters Enhancers mRNA lncRNA miRNA
Transcription factors (TFs) are regulatory proteins that bind DNA to activate or repress target gene transcription. TFs play a central role in controlling biological processes, and are often mis-regulated in diseases . Technological developments over the last decade have allowed the characterization of binding preferences for many transcription factors both in vitro [2, 3] and in vivo . The current view is that TF combinations underlie the specificity of eukaryotic gene expression regulation , with several TFs competing and collaborating to regulate common target genes. As reviewed in Morgunova et al.  and Reiter et al. , multiple mechanisms can lead to TF cooperation. In its simplest form, cooperation involves direct TF-TF interactions before any DNA binding. But cooperation can also be mediated through DNA, either with DNA providing additional stability to a TF-TF interaction , or even without any direct protein-protein interaction. Different mechanisms are possible for the later. For example, the binding of one TF may alter the DNA shape in a way that increases the binding affinity of another TF . Another system is the pioneer/settler hierarchy described in Sherwood et al. , with settler TFs binding DNA only if adequate pioneer TFs have already bound to open the chromatin. Lastly, other authors have hypothesized a non-hierarchical cooperative system, with multiple concomitant TF bindings mediated by nucleosomes . This is related to the “billboard” system proposed for enhancers . On the other hand, TFs that belong to the same protein family usually share identical or similar motifs and may compete for sites that match both motifs .
Several papers have studied the combinatorics of TFs from a statistical point of view. Most works aim to identify co-occurring TF pairs, i.e. pair of TFs showing binding sites that are in closest proximity than one would expect by chance. These analyses have been done either on the basis of TF binding sites (TFBSs) predicted in silico [13, 14] or with TFBSs obtained from ChIP-seq experiments [15, 16]. Depending on the approach, different difficulties may arise. In silico predicted TFBSs are known to include large amount of false positives (see below), which may bias the analyses and impede the discovery of co-occurring TFBSs. On the other hand, studies based on ChIP-seq data require as many ChIP-seq data as the number of studied TFs, and hence are intrinsically limited by the availability of these data. Moreover, with hundreds (or even thousands) of sequences, a small co-occurrence tendency may be statistically significant, even if the effect is actually very weak and would not be biologically relevant. A few works have studied TF combinations in a more global way, above the TF pair level. For example, Teng et al.  have applied the “frequent itemset” methodology to identify sets of co-occurring TFBSs on the basis of ChIP-seq data. However, many questions remain on the molecular determinants orchestrating TF binding and combinations . Notably, with the expanding coding capacity of the human genome [19, 20], it remains to determine whether the expression of all gene classes, in particular coding mRNAs, long non-coding(lnc)RNAs and micro(mi)RNAs, is controlled by similar TF combinations in a given cell type. Likewise, TFs control gene expression through the binding of promoters and enhancers, which harbor similar but also specific genomic features . It is then not clear whether the binding preferences of a given TF are similar in enhancers and promoters.
Here, we analyze global TF combinations from a different perspective. Rather than identifying TF pairs/sets that co-occur more frequently than expected by chance, we aim to identify TF combinations that can be predictive of the binding of a target TF. More formally, given a class of regulatory sequences (for example 500 bp around the TSSs of the coding genes) and a ChIP-seq experiment targeting a specific TF in a specific cell type, we aim to identify the combinations of TFs whose predicted TFBSs can be used for predicting which sequences are effectively bound by the target TF in this cell type. Hence, rather than using purely statistical co-occurrence analysis, we study TF combinations in the framework of a TFBS prediction problem. The approach has several advantages. First, a single ChIP-seq experiment is theoretically sufficient to identify all TFs cooperating/competing with the target TF in the target cell type. Next, if a TF is selected in the combination, this means that its predicted binding sites are indicative of the presence of the target TF, which limit the number of false positives and the problems of spurious statistical significances. Finally, the approach takes into account all TFs and can therefore identify all possible TF combinations not just TF pairs.
TFBSs are traditionally modeled with position weight matrices (PWMs) . Several databases such as JASPAR , HOCOMOCO , CisBP  and Transfac , propose position frequency matrices (PFM, which can be transformed in PWMs) for hundred of TFs. These PWMs can be used to scan sequences and identify TFBSs using tools such as FIMO  or MOODS . However, while a PWM usually identifies thousands of potential binding sites for a given TF in the genome , ChIP-seq analyses have revealed that only a fraction of those sites are effectively bound . There may be different reasons for this discrepancy between predictions and experiments. First, PWMs implicitly assume that the positions within a TFBS independently contribute to binding affinity. Several approaches have thus been proposed to account for positional dependencies within the TFBS (see for example [31, 32]). Other studies have focused on the TFBS genomic environment, revealing that TFs positions of their core binding sites [33, 34]. Beyond the primary nucleotide sequence, structural constraints may also affect TF binding. For example, it is thought that TFs use DNA shape features to distinguish binding sites with similar DNA sequences [35, 36]. Some attempts have thus been made to integrate DNA shapes information with PWMs [37, 38]. Other studies have investigated the link between TF binding and epigenetic marks, showing that many TFs bind regions associated with specific histone marks . Similarly, ChIP-seq experiments also revealed that most TFBSs fall within highly accessible (i.e., nucleosome-depleted) DNA regions . Consequently, several studies have proposed to supplement PWM information with DNA accessibility data to identify the active TFBSs in a given cell type [41, 42, 43]. However, it remains unclear whether these chromatin states are a cause or a consequence of TF binding . Hence, while these approaches may be very informative for predicting TF binding, they should be used with caution if the goal is also to identify the DNA determinants of the binding. Besides, these approaches do not take into account TF combinations, which, as already discussed, may be important determinants of TF binding. For this reason, studying TF combinations through a TFBS prediction problem appears as an appealing approach.
It is important to note that beyond approaches based on known PWMs, several ab initio methods have also been proposed recently for predicting TFBSs from raw data sequences. Notably, deep learning approaches based on neural networks have proved to give higher prediction accuracy than simple PWM-based methods [45, 46]. However, ab initio methods, and particularly neural network approaches, are difficult to interpret (the inherent trade-off between accuracy and interpretability). Although some attempts have been made to post-analyze learned neural networks (see for example ), studying TF combinations and DNA determinants of TF binding from these models is not straightforward.
Hence, we devised a simple non ab initio strategy names TFcoop that predicts if a target TF binds a sequence of interest using two kinds of variables: i) the binding affinity (i.e. PWM affinity score) of the target TF as well as any other TF identified as cooperating with the target TF; and ii) the nucleotide composition of the sequence. TFcoop is based on a logistic model. The set of cooperating TFs and the model parameters are learned from ChIP-seq data of the target TF via LASSO penalization . Learning can be done using a moderate amount of data, which allows us to learn specific models for different types of regulatory sequences. Using ChIP-seq data from the ENCODE project, we applied TFcoop to investigate the TF combinations involved in the binding of 106 different TFs on 41 different cell types and in four different regulatory regions: promoters of mRNAs, lncRNAs and miRNAs, and enhancers [19, 20, 49, 50]. We first showed that the approach outperforms simple PWM methods and has surprisingly good accuracy, close to that of ab initio methods like DeepSea . We next assessed with independent experimental data that the cooperative TFs predicted by TFcoop actually bind the same regulatory sequences as the target TF. Then, we used TFcoop to analyze TF combinations in different cell types and regulatory regions. First, we show that TF combinations governing the binding of the target TF on promoters are similar for different cell-types but distinct in the case of enhancer binding. Second, for a given TF, we observe that TF combinations are different between promoters and enhancers, but similar for promoters of all gene classes (mRNAs, lncRNAs, and miRNAs). Analysis of the composition of TFs cooperating with the different targets show over-representation of pioneer TFs , especially in promoters, as well as binding sites with nucleotide composition similar to that of the target TF. We also observed that cooperating TFs are enriched for TFs whose binding is weakened by methylation . Lastly, our models can accurately distinguish promoters into classes associated with specific biological processes.
Given a target TF, the TFcoop method identifies the TFBS combination that is indicative of the TF presence in a regulatory region. We first considered the promoter region of all mRNAs (defined as the 1000bp centered around gene start). TFcoop is based on a logistic model that predicts the presence of the target TF in a particular promoter using two kinds of variables: PWM affinity scores and (di)nucleotide frequencies (see “Methods” section). For each promoter sequence, we computed the affinity score of the 638 JASPAR PWMs (redundant vertebrate collection for 519 different TFs), and the frequency of every mono- and dinucleotide in the promoter. These variables were then used to train a logistic model that aims to predict the outcome of a particular ChIP-seq experiment in mRNA promoters. Namely, every promoter sequence with a ChIP-seq peak is considered as a positive example, while the other sequences are considered as negative examples (see below). In the experiments below, we used 409 ChIP-seq datasets from ENCODE and different models. Each model targets one TF and one cell type. Given a ChIP-seq experiment, the learning process involves selecting the PWMs and (di)nucleotides that can help discriminate between positive and negative sequences (this is done by way of the LASSO penalization ), and estimate the model parameters that minimize prediction error. Note that the learning algorithm can select any predictive variable including the PWM of the target TF. See “Methods” section for more details on the data and logistic model. Note also that, while several classification approaches are available in the literature, all methods are not suitable for our problem. Because our aim is to identify TF combinations, only methods implementing a feature selection procedure are eligible. To this aim, LASSO penalization is often considered as a method of choice . An alternative would be to use classification trees, but this method is known to suffer from stability issues .
We used two different procedures for selecting the positive and negative sequences. Each procedure actually defines a different classification problem. In the first case, we kept all positive sequences (i.e. promoters overlapping a ChIP-seq peak in the considered ChIP-seq experiment), and randomly selected the same number of negative sequences among all sequences that do not overlap a ChIP-seq peak. In the second case, we used an additional dataset that measures gene expression in the same cell type as the ChIP-seq data. We then selected all positive sequences with non zero expression level and randomly selected the same number of negative sequences among all sequences that do not overlap a ChIP-seq peak but that have a similar expression level as the selected positive sequences. Hence, in this case (hereafter called the expression-controlled case), we learn a model that predicts the binding of a target TF in a promoter knowing that the corresponding gene is expressed. On the contrary, in the first case we learn a model that predicts the binding without knowledge about gene expression. The purpose of the expression-controlled case is to decipher TF combinations independently of the effect of epigenetic modifications that are linked to expression (e.g. DNA methylation and various histone marks). As all selected sequences are associated with expressed genes, the positives and negatives sequences are likely to be associated with the same epigenetic marks.
Next, we sought to assess the TF cooperations inferred by the models. If true, they should be apparent in the ChIP-seq experiments. Namely, if the PWM of TF B is among the selected variables for predicting the presence of TF A, then we should observe many common targets among the ChIP-seq experiments of TFs A and B. To test this, we first randomly selected one model for each different TF, and restricted our analyses to the PWMs associated with TFs with ENCODE ChIP-seq experiments. Then, for each model A, we measured the Jaccard index between promoters bound by TF A and promoters bound by a TF B whose PWM has been selected in model A (cases B = A were not considered), and we compared these scores to the same scores computed on TFs whose PWMs have not been selected in model A (see “Methods” section). The LASSO procedure allows us to rank the selected variables from the most predictive to the less predictive ones. We measured the Jaccard index for different cutoffs in this ranked list and observed that Jaccard indexes i) vary accordingly with the cutoff and ii) are always larger than Jaccard indexes computed for non-selected TFs (t-test p-values <1.e−16; see Fig. 1c). Hence, the inferred TF cooperations are supported by experimental data. For comparison purpose, we repeated the same analysis with TF pairs identified by Myšičková and Vingron (2012) and found very similar performance measures (see Fig. 1c).
Finally, we sought to take advantage of the relative redundancy of target TFs in the set of 409 ChIP-seq experiments to investigate the specificity of the learned models. Namely, we compared pairs of models learned from ChIP-seq experiments targeting (i) the same TF in the same cell-type, (ii) the same TF in different cell-types, (iii) different TFs in the same cell-type, and (iv) different TFs in different cell-types. In these analyses, we used the model learned on one ChIP-seq experiment A to predict the outcome of another ChIP-seq experiment B, and we compared the results to those obtained with the model directly learned on B. More precisely, we measured the difference of AUC between the model learned on A and applied on B and the model learned and applied on B. To avoid any effect driven by the over-representation of CTCF in ChIP-seq data, we randomly selected only 10 ChIP-seq experiments targeting this TF in these analyses. As shown in Fig. 1d and e, models learned on the same TF (whether or not on the same cell-type) have overall smaller AUC differences than models learned on different TFs. For sake of comparison, we also ran the same analysis on non-ENCODE ChIP-seq data targeting 10 different TFs (see “Methods” section). Namely, we used the models learned on the corresponding ENCODE data to predict the outcome of these 10 non-ENCODE data. Results are overall very similar to those obtained on ENCODE data (median AUC 0.83 on ENCODE data vs. 0.82 on non-ENCODE data; see Additional file 1: Figure S6).
We then analyzed cell and TF specificity more precisely. Cell specificity refers to the ability of a model learned on one TF and in one cell type to predict the outcome of the same TF in another cell type. Oppositely, TF specificity refers to the ability of a model learned on one TF in one cell type to predict the outcome of another TF in the same cell type. Cell and TF specificities were evaluated by the shift between the associated distributions of AUC differences in Fig. 1d: cell specificity was assessed by the shift between red and yellow distributions, while TF specificity was assessed by the shift between red and light blue distributions. We used a standard t-test to measure that shift. Low p-values indicate high distribution shifts (hence high cell/TF specificity), while high p-values indicate low shifts (hence low specificity). Our results indicate very low cell specificity (p-values 0.91 and 0.95 in the non-controlled and expression-controlled cases, respectively) and high TF specificity (1·10−61 and 3·10−83). The fact that the TF specificity is slightly higher in the expression-controlled case suggests that part of the TF combinations that help discriminate between bound and unbound sequences is common to several TFs in the non-controlled case. It is indeed known that the majority of ChIP-seq peaks are found in open and active promoters . Thus, most positive examples are associated with open chromatin marks. However, in the non-expression-controlled case, a large part of the negative examples are in closed chromatin and are therefore likely associated with other chromatin marks. As a result, in this case, TFcoop presumably also learns the TFBS signature that helps differentiate between these chromatin marks. Oppositely, in the expression-controlled case, the positive and negative examples have similar chromatin states, and TFcoop unveils the TFBS signature specific to the target TF. We can also observe that this renders the former problem slightly easier than the second one, as illustrated by the difference of TFcoop performances in Fig. 1a (t-test p-value 2.6e−18). Finally the low cell specificity means that the general rules governing TFBS combination in promoters do not dramatically change from one tissue to another. This is important in practice because it enables us to use a model learned on a specific ChIP-seq experiment to predict TBFSs of the same TF in another cell type.
Analysis of TFBS combinations in promoters
We next analyzed the different variables (PWM scores and (di)nucleotide frequencies) that were selected in the 409 learned models. Overall, 95% of the variables correspond to PWM scores. Although only 5% of the selected variables are (di)nucleotide frequencies, almost all models include at least one of these features. As mentioned earlier, the learning algorithm does not use any prior knowledge and can select the variables that best help predict the ChIP-seq experiment without necessarily selecting the PWM of the target TF. Our analysis shows that, for 75% of the models, at least one version of the target PWM was selected. Moreover, it is important to note that similar PWMs tend to have correlated scores. Hence, another PWM may be selected instead of the target. To overcome this bias, we also considered all PWMs similar to the target PWM. We used Pearson correlation between PWM scores in all promoters to measure similarity and set a threshold value of 0.75 to define the list of similar PWMs. With this threshold, 90% models include the target or a similar PWM. Analysis of the remaining 10% models shows that they often correspond to ChIP-seq experiments with low number of positive sequences (median number 955 vs. 2477 for all ChIP-seq experiments). This may be due either to technical problems, to lowly expressed TFs, or to TFs that rarely bind promoters.
Next we thought to investigate the contribution alternative PWMs may have on model performance. For this, we ran a whole new analysis using the non-redundant JASPAR PWM library (one PWM per TF, i.e. 519 PWMs). As shown in Additional file 1: Figure S5, results are slightly less accurate than with the complete (redundant) database, illustrating the fact that alternative motifs provide important information unveiled by TFcoop.
We further analyzed the most selected PWMs. To avoid any bias linked to the number of CTCF ChIP-seq experiments, we only considered 10 CTCF models that were randomly selected for the analyses. We ranked the PWMs by the number of models in which they appear, and look for enrichment of certain JASPAR structural families (bHLH, Zinc finger, …). A gene set enrichment analysis (GSEA, see “Methods” section)  shows that “tryptophan cluster factors” (FDR q-val <10−4), “C2H2 zinc finger factors” (FDR q-val <10−4) and “basic leucine zipper factors” (FDR q-val =2·10−3) are the most represented classes of PWMs selected in the models (Additional file 1: Figure S7). We then looked at the differences between models learned in the expression-controlled experiments and models learned in the non-controlled experiments. For each non-controlled model, we enumerated the variables that are selected in this model and not selected in the corresponding expression-controlled model. Several PWMs are over-represented in this list (see Additional file 1: Table S1).
We next evaluated the possibility of clustering the 409 learned models using the selected variables. As shown in Additional file 1: Figure S10, the models can be partitioned in a few different classes with a k-means algorithm (5 classes were used in this figure). Additional file 1: Figure S11 reports the most used variables in these different classes. We can first observe that, in agreement with our analysis of model specificity, the models associated with the same TF tend to cluster together. For example, the 4th class of our clustering (the blue one in Additional file 1: Figure S10) is exclusively composed of CTCF models. Note that we did not observe any enrichment for the classical TF structural families (bHLH, Zinc finger, …) in the different classes (data not shown). Actually, the clustering seems to be essentially driven by the nucleotide composition of the PWMs belonging to the models (see Additional file 1: Figure S12).
Pioneer TFs are thought to play an important role in transcription by binding to condensed chromatin and enhancing the recruitment of other TFs . As shown in Fig. 2b and by a GSEA analysis (Additional file 1: Figure S9), pioneer factors clearly are over-represented in the selected variables of the models, whereas they represent less than 14% of all TFs. These findings are in agreement with their activity: pioneer TFs occupy previously closed chromatin and, once bound, allow other TFs to bind nearby . Hence the binding of a given TF requires the prior binding of at least one pioneer TF. We also observed that TFs whose binding is weakened by methylation  are enriched in all models (Additional file 1: Figure S13). This result may explain how CpG methylation can negatively regulate the binding of a given TF in vivo while methylation of its specific binding site has a neutral or positive effect in vitro : regardless of the methylation status on its binding site, the binding of a TF can also be influenced in vivo by the sensitivity of its partners to CpG methylation.
TFBS combinations in lncRNA and pri-miRNA promoters
Next we sought to compare the models learned on mRNA promoters to the models learned on lncRNA and pri-miRNA promoters. For this, we interchanged the models learned on the same ChIP-seq experiment, i.e. we used the model learned on mRNA promoters to predict the outcome on lncRNA and pri-miRNA promoters. One striking fact illustrated by Fig. 3c and Additional file 1: Figure S15 is that models learned on mRNA promoters and those learned on lncRNA promoters are almost perfectly interchangeable. This means that the TFBS rules governing the binding of a specific TF in a promoter are similar for both types of genes. We obtained consistent results when we used the models learned on mRNAs to predict the ChIP-seq outcomes on pri-miRNA promoters (Fig. 3c and Additional file 1: Figure S15). Accuracy is even better than that obtained by models directly learned on pri-miRNA promoters, illustrating the fact that the poor performance achieved on pri-miRNA promoters likely results from the small number of learning examples available for these genes.
TFBS combinations in enhancers
We next applied the same approach on 38,554 enhancers defined by the FANTOM consortium . We used the same ChIP-seq experiments as for the promoters. All enhancer sequences overlapping a ChIP-seq peak in the considered ChIP-seq experiment were considered as positive examples. As for promoters, we used two strategies to select positives and negative examples: in a first case we did not apply any control on their expression, while in a second case, we used CAGE expression data in the different tissues to only select expressed enhancers.
However, analysis of model specificity reveals somewhat different results from that observed for promoters. Globally, models have good TF specificity: models learned on the same TF have more similar prediction accuracy than models learned on different TFs. However, in contrast to promoters, cell specificity is high in the non-controlled case (p-value 2·10−45; see peak shift in Fig. 4d), although much lower in the expression-controlled case (p-value 1.6·10−12). Additionally, TF specificity seems slightly higher in the expression-controlled case than in the non-controlled case (p-values 1.7·10−102 vs. 1.·10−114). This is in accordance with our hypothesis formulated for promoters, that part of the TF combinations learned by TFcoop in the non-controlled case actually differentiates between active and inactive chromatin marks. This also seems to indicate that these TF combinations are cell-type specific, while the remaining combinations are more general (as illustrated by the 1.6·10−12p-value measured on the expression-controlled case). Moreover, analysis of selected variables reveals that models learned without expression control involve much more variables than models learned with expression control (median numbers 18 vs. 11; t-test p-value ∼10−9). As a consequence, several variables are statistically more abundant in non-controlled models than in the cognate expression-controlled models (see Additional file 1: Table S1). Interestingly, among the four variables with the most important differences, three are dinucleotides CpG, TpC and ApT. This may indicate that part of the active/inactive chromatin marks is linked to the dinucleotide composition of the underlying sequence. This proposal is in line with findings revealing the existence of sequence-level instructions for chromatin modifications [45, 46, 56]. Moreover, a GSEA analysis shows that the PWMs with the strongest differential enrichments belong to the “three-zinc finger kruppel-related factors” (FDR q-val 1·10−2). As some of these factors, in particular KLF1 , are linked to chromatin remodeling, this enrichment supports the idea that TFcoop also identifies TF combinations linked to epigenetics. The fact that cell-type specificity is more apparent for enhancers than for promoters in the non expression-controlled case (2·10−45 for enhancers vs. 0.91 for promoters) is in accordance with the fact that, contrary to promoters, most enhancers are expressed in a cell-specific manner (Additional file 1: Figure S17 and ref. ).
Using TFcoop scores to describe regulatory sequences
Next, we used the TFcoop models to discriminate between mRNA promoters and enhancers. We randomly split the sets of promoters and enhancers in training and test sets, and learned a K-nearest neighbor (KNN) classifier for discriminating between promoter and enhancer sequences on the basis of scores of the TFcoop models learned on promoters. As above, we also used the classical PWM scores of the same 106 selected TFs and (di)nucleotide frequencies of the sequences. We resumed the procedure with a number of neighbors (K) varying between 1 and 20, and computed the number of errors obtained by each approach on the test set (Fig. 7b). Here again, TFcoop description outperforms other description methods, with an error rate around 2% for TFcoop vs. 15% and 25% for the other approaches. This result confirms the existence of DNA features distinguishing enhancers from mRNA promoters [21, 50] and identifies TF combinations as potent classifiers.
In this paper we proposed a method to identify TF combinations that can be predictive of the binding of a target TF. Our approach is based on a logistic model learned from ChIP-seq experiments on the target TF. Cross-validation study showed that the approach is effective and outperforms classical PWM based approaches on many TFs. It is important to note that TFcoop combinations do not necessarily reflect just cooperation, but also competition. For instance, a TF A competing with a TF B may be useful to predict the binding of B and would thus appear in the TF B model while A and B do not cooperate.
We distinguished two prediction problems associated with two situations, depending whether the aim is to predict binding in any promoter/enhancer or solely in expressed promoters/enhancers. For expressed promoters/enhancers, our experiments showed that the learned models have high TF specificity and quite low cell-type specificity. On the other hand, for the problem of expressed and not expressed promoters/enhancers binding, the learned models are less TF specific and more cell-type specific (especially for enhancers). These results are in accordance with a two-level model of gene regulation: (i) cell-type specific level that deposits specific chromatin marks on the genome, and (ii) non, or poorly, cell-type specific level that regulates TF binding in all DNA regions associated with appropriate marks.
An important property highlighted by our models is that rules governing TF combinations are very similar in the promoters of the three gene types analyzed (mRNA, pri-miRNA and lncRNA), but different between promoters and enhancers. Our results are in agreement with that of Andersson et al. , who showed that some motifs are enriched in enhancers (e.g. AP-1 or OCT4), while other are enriched in promoters (ELF1 or NRF1). We further confirmed these differences between promoter and enhancer sequences showing that scores produced by TFcoop models allow accurate classification between the two types of sequences. Our results thus argue for a prominent role of transcription factor binding as the fundamental determinant of regulatory activity able to distinguish enhancers and promoters . Furthermore, as promoters and enhancers produce different RNA molecules [21, 50], our results also suggest that the production of enhancer RNAs (eRNAs) on one hand, and that of mRNAs, lncRNAs, and miRNAs on the other hand, requires a specific and distinct subset of TFs.
Our approach could be improved in several ways. A quite straightforward improvement would be to use the DNAshape score developed by Mathelier et al.  instead of the classical PWM score. This could improve TFcoop accuracy for several TFs, especially for TFs such as CTCF for which TFcoop does not outperform classical PWM scoring. More profoundly, one drawback of TFcoop is that the logistic model enables us to learn only a single TF combination for each target TF. However, we can imagine that certain TFs may be associated with two or more different TF combinations depending on the promoter/enhancer they bind. A solution for this would be to learn a discrimination function based on several logistic models instead of a single one.
On the whole, studying TF combinations by the way of a TFBS prediction problem appears as a promising approach. We showed here that, despite its simplicity, the TFcoop method is accurate and allows identifying complex combinations on the basis of a single ChIP-seq experiment. We used it on ENCODE and FANTOM data and identified important properties of TF combinations in human. Specifically we showed that combinations governing TF binding on enhancers are more cell-type specific than that governing binding in promoters. Moreover, for a given TF and cell type, TF combinations are different between promoters and enhancers, but similar for promoters of mRNAs, lncRNAs and pri-miRNAs. Finally, analysis of the TFs cooperating with the different targets show over-representation of pioneer TFs and a clear preference for TFs with binding motif composition similar to that of the target.
Promoter, enhancer, long non-coding RNA and microRNA sequences
We predicted TF binding in both human promoters and enhancers. For promoters, sequences spanning ±500 bp around starts (i.e. most upstream TSS) of protein-coding genes, long non-coding RNAs and microRNAs were considered. Starts of coding and lncRNA genes were obtained from the hg19 FANTOM CAGE Associated Transcriptome (CAT) annotation [19, 49]. Starts of microRNA genes (primary microRNAs, pri-miRNAs) were from . For enhancers, sequences spanning ±500 bp around the mid-positions of FANTOM-defined enhancers  were used. Lastly, our sequence datasets are composed of 20,845 protein coding genes, 1250 pri-microRNAs, 23,887 lncRNAs, and 38,553 enhancer sequences.
Nucleotide and dinucleotide features
For each of these sequences, we computed nucleotide and dinucleotide relative frequencies as the occurrence number in the sequence divided by sequence length. Frequencies were computed in accordance with the rule of DNA reverse complement. For nucleotides, we computed the frequency of A/T and G/C. Similarly, frequencies of reverse complement dinucleotides (e.g. ApG and CpT) were computed together. This results in a total of 12 features (2 nucleotides and 10 dinucleotides).
We used vertebrate TF PFMs from JASPAR , including all existing versions of each PFM, resulting in a set of 638 PFMs with 119 alternative versions (i.e. 519 different TFs). We also used the non-redundant version of the JASPAR vertebrate database (519 PFMs) and the two alternative PFM libraries CisBP  and HOCOMOCO . CisBP is a meta-library gathering PFMs from various sources, which contains up to 972 human PFMs (http://cisbp.ccbr.utoronto.ca). We collected, for each TF, all directly determined motifs indicated in TF_Information.txt of the Homo sapiens archive. To avoid redundancy, we selected only one model for each TF by arbitrarily selecting the longest PWM. Moreover we also excluded all TRANSFAC PWMs that are not publicly available (this reduces the set of TFs associated with a PFM to 625). Note that CisBP is built on JASPAR and HOCOMOCO 2014 versions. For HOCOMOCO, we used the human PCM v11 full collection of the core mononucleotide models (771 PCMs corresponding to 680 TFs). PCMs were converted into PFMs and PFMs were further transformed into PWMs as described in Wasserman and Sandelin . PWM scores used by TFcoop for a given sequence were computed as described in , keeping the maximal score obtained in any position of the sequence. Namely, each PWM was used to scan the entire sequence and score each position, and the maximal score was used as potential predictive feature by TFcoop.
We collected ChIP-seq data from the ENCODE project  for human immortalized cell lines, tissues, and primary cells. Experiments were selected when the targeted TF were identified by a PWM in JASPAR. Thus we studied 409 ChIP-seq experiments for 106 distinct TFs and 41 different cell types. The most represented TF is CTCF with 69 experiments, while 88% of the experiments are designed from immortalized cell lines (mainly GM12878, HepG2 and K562). The detailed list of all used experiments is given in Supplementary materials. For each ChIP-seq experiment, regulatory sequences were classified as positive or negative for the corresponding ChIP targeted TF. We used Bedtools v2.25.0  to detect intersection between ChIP-seq binding sites and regulatory sequences (both mapped to the hg19 genome). Each sequence that intersects at least one ChIP-seq binding region was classified as a positive sequence. The remaining sequences formed a negative set. The number of positive sequences varies greatly between experiments and sequence types. Mean and standard deviation numbers of positive sequences are respectively 2661(±1997) for mRNAs, 1699(±1151) for lncRNAs, 216(±176) for microRNAs, and 1516(±1214) for enhancers. For sake of comparison, we also used non-ENCODE ChIP-seq data collected from the Cistrome database  (http://cistrome.org). Note that Cistrome provides hg38 ChIP-seq peaks, not narrow peaks as provided by ENCODE. We collected data corresponding to GSM2224586 (ELF1), GSM1056931 (ETS1), GSM894076 (MAX), GSM1423725 (MYC), GSM1698353 (USF1), GSM1614036 (JUN), GSM2042914 (JUND), GSM1917774 (ATF3), GSM1708340 (YY1) and GSM1334010 (ZBTB33). The bed files were liftovered into hg19 coordinates using UCSC liftover tool.
To control the effect of expression in our analyses, we used ENCODE CAGE data restricted to 41 cell lines. The expression per cell line was calculated as the mean of the expression observed in all corresponding replicates. For microRNAs, we used the small RNA-seq ENCODE expression data collected for 3043 mature microRNAs in 37 cell lines (corresponding to 403 ChIP-seq experiments). The expression of microRNA genes (i.e. pri-microRNAs) was calculated as the sum of the expression of the corresponding mature microRNAs.
We propose a logistic model to predict the regulatory sequences bound by a specific TF. Contrary to classical approaches, we not only consider the score of the PWM associated with the target TF, but also the scores of all other available PWMs. The main idea behind this is to unveil the TF interactions required for effective binding of the target TF. We also integrate in our model the nucleotide and dinucleotide compositions of the sequences, as the environment of TFBSs are thought to play major role in binding affinity [33, 34].
To perform variable selection (i.e. identifying cooperating TFs), we used the LASSO regression minimising the prediction error within a constraint over l1-norm of β . The weight of the LASSO penalty is chosen by cross-validation by minimising the prediction error with the R package glmnet  (see below).
TFcoop models were trained with the cv.glmnet function of the glmnet package, with options nfolds=10. This runs a 10-fold cross validation. In each validation loop, 90% of sequences are used to learn the β parameters and the remaining 10% are used to evaluate the predictive performance of the model. We set the option keep=TRUE to memorize the predictions achieved during cross-validation. These predictions were then systematically used in the AUC estimations to avoid over-fitting. There are two different situations here: when computing the AUC of a model trained on the same ChIP-seq data (for example in Fig. 1a) only the cross-validated predictions were used. However, when computing the AUC of a model trained on a different ChIP-seq data (for example in Fig. 1d), all test sequences do not belong to the training data (because all negative sequences were not used for training). In this case, we used the learned model to predict the outcome of the sequences that do not belong to the training data, and we used the prediction obtained during cross-validation by cv.glmnet for the other sequences.
We compared the predictive accuracy of our model to three other approaches.
Best hit approach The traditional way to identify TF binding sites consists in scanning a sequence and scoring the corresponding PWM at each position. Positions with a score above a predefined threshold are considered as potential TFBS. A sequence is then considered as bound if it contains at least one potential TFBS.
TRAP score An alternative approach proposed by Roider et al.  is based on a biophysically inspired model that estimates the number of bound TF molecules for a given sequence. In this model, the whole sequence is considered to define a global affinity measure, which enables us to detect low affinity bindings. We use the R package tRap  to compute the affinity score of the 638 PWMs for all sequences. As proposed in , we use default values for the two parameters (R0(width), λ=0.7).
DNA shape In addition to PWMs, Mathelier et al.  considered 4 DNA shapes to increase binding site identification: helix twist, minor groove width, propeller twist, and DNA roll. The 2nd order values of these DNA shapes are also used to capture dependencies between adjacent positions. Thus, each sequence is characterized by the best hit score for a given PWM plus the 1st and 2nd DNA shape order values at the best hit position. The approach based on gradient boosting classifier requires a first training step with foreground (bound) and background (unbound) sequences to learn classification rules. Then the classifier is applied to the set of test sequences. We used the same 10-fold cross-validation scheme that we used in our approach. We applied two modifications to speed-up the method, which was designed for smaller sequences. First, in the PWM optimization step of the training phase, we reduced the sequences to ±50 bp around the position with highest ChIP-Seq peak for positive sequences and to ±50 bp around a random position for negative sequences. Second, after this first step we also reduced sequences used to train and test the classifiers to ±50 bp around the position for which the (optimized) PWM gets the best score.
DeepSEA Zhou and Troyanskaya  proposed a deep learning approach for predicting the binding of chromatin proteins and histone marks from DNA sequences with single-nucleotide sensitivity. Their deep convolutional network takes 1000 bp genomic sequences as input and predicts the states associated with several chromatin marks in different tissues. We used the predictions provided by DeepSEA server (http://deepsea.princeton.edu/). Namely, coordinates of the analyzed promoter and enhancer sequences were provided to the server, and the predictions associated with each sequence were retrieved. Only the predictions related to the ChIP-seq data we used in our analyses were considered (i.e. 214 ChIP-seq data in total).
Model clustering We used the kmeans procedure implemented in R to classify the 2×409 models trained on promoters and enhancers. Each model was described by a Boolean vector describing the selected/non-selected variables (dinucleotides and PWMs). Different numbers of classes from 1 to 10 were tested. For each number, the kmeans was run 200 times and the best classification (according to the statistic optimized by the kmeans) was returned. To choose the “best” number of classes, we used a very simple procedure. We plotted the kmeans statistics vs. their corresponding class numbers and selected what can be considered as the best trade-off between modelling and complexity (see Additional file 1: Figure S10).
GSEA analyzes We used the GSEA program from the Broad Institute  to assess enrichment of specific annotations among the PWMs selected in our model. Different experiments have been done. In one experiment we ranked the PWMs by the number of models in which they appear (in promoters, and then in enhancers), and look for enrichment of certain JASPAR structural families (PWM annotations provided by JASPAR) or of pioneer factors (see below) in the PWMs at the top of the list. In other experiments, PWMs were ranked by their difference of utilization between models learned in the expression controlled experiments and non-expression controlled experiment (see Additional file 1: Table S1) or between promoter models and enhancer models (Additional file 1: Table S2) and we looked for enrichment of certain JASPAR structural families in the top PWMs of these lists.
Pioneer factors We used the classification of  to distinguish pioneers, settlers and migrants TFs. ’Pioneer’ TFs occupy previously closed chromatin and, once bound, allow other TFs to bind nearby. ’Settler’ designate TFs whose binding is predominantly dependent on the openness of chromatin at their motifs. ’Migrants’ bind only sporadically even when chromatin at their motifs is open.
We thank Anthony Mathelier and Wyeth W. Wasserman for insightful discussions and suggestions. We are indebted to researchers around the globe who generated experimental data and made them freely available.
This work was supported by funding from CNRS, Plan d’Investissement d’Avenir #ANR-11-BINF-0002 Institut de Biologie Computationnelle (young investigator grant to C-H.L. and post-doctoral fellowship to J.V.), Labex NUMEV (post-doctoral fellowship to J.V.), INSERM-ITMO Cancer project “LIONS” BIO2015-04, and CNRS International Associated Laboratory “miREGEN”.
Availability of data and materials
The R code for learning a TFcoop model from a ChIP-seq experiment is available in an R Markdown file at address https://gite.lirmm.fr/brehelin/TFcoop. This file also provides the R scripts for reproducing the main experiments described in the paper. The different models learned on mRNA, lncRNA, miRNA promoters and enhancers are available as R object (.RData) at the same address.
Conceptualization: LB, SL, CHL, JV; Funding acquisition: LB, CHL; Methodology: LB, SL, JV, CHL; Software: JV, LB, OC; Analysis and data interpretation: LB, SL, JV, CHL, OC; Supervision: LB; Writing: LB, CHL, JV. All authors have read and approved the manuscript.
Ethics approval and consent to participate
Consent for publication
The authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
- 3.Jolma A, Kivioja T, Toivonen J, Cheng L, Wei G, Enge M, Taipale M, Vaquerizas JM, Yan J, Sillanpää MJ, Bonke M, Palin K, Talukder S, Hughes TR, Luscombe NM, Ukkonen E, Taipale J. Multiplexed massively parallel SELEX for characterization of human transcription factor binding specificities. Genome Res. 2010; 20(6):861–73. https://doi.org/10.1101/gr.100552.109. 00245.PubMedPubMedCentralCrossRefGoogle Scholar
- 9.Sherwood RI, Hashimoto T, O’Donnell CW, Lewis S, Barkal AA, van Hoff JP, Karun V, Jaakkola T, Gifford DK. Discovery of directional and nondirectional pioneer transcription factors by modeling DNase profile magnitude and shape. Nat Biotechnol. 2014; 32(2):171–8. https://doi.org/10.1038/nbt.2798.PubMedPubMedCentralCrossRefGoogle Scholar
- 12.Wang J, Zhuang J, Iyer S, Lin X, Whitfield TW, Greven MC, Pierce BG, Dong X, Kundaje A, Cheng Y, Rando OJ, Birney E, Myers RM, Noble WS, Snyder M, Weng Z. Sequence features and chromatin structure around the genomic regions bound by 119 human transcription factors. Genome Res. 2012; 22(9):1798–812.PubMedPubMedCentralCrossRefGoogle Scholar
- 19.Hon C-C, Ramilowski JA, Harshbarger J, Bertin N, Rackham OJL, Gough J, Denisenko E, Schmeier S, Poulsen TM, Severin J, Lizio M, Kawaji H, Kasukawa T, Itoh M, Burroughs AM, Noma S, Djebali S, Alam T, Medvedeva YA, Testa AC, Lipovich L, Yip C-W, Abugessaisa I, Mendez M, Hasegawa A, Tang D, Lassmann T, Heutink P, Babina M, Wells CA, Kojima S, Nakamura Y, Suzuki H, Daub CO, de Hoon MJL, Arner E, Hayashizaki Y, Carninci P, Forrest ARR. An atlas of human long non-coding RNAs with accurate 5’ ends. Nature. 2017; 543(7644):199–204. https://doi.org/10.1038/nature21374. 00009.PubMedCrossRefGoogle Scholar
- 20.de Rie D, Abugessaisa I, Alam T, Arner E, Arner P, Ashoor H, Astrom G, Babina M, Bertin N, Burroughs AM, Carlisle AJ, Daub CO, Detmar M, Deviatiiarov R, Fort A, Gebhard C, Goldowitz D, Guhl S, Ha TJ, Harshbarger J, Hasegawa A, Hashimoto K, Herlyn M, Heutink P, Hitchens KJ, Hon CC, Huang E, Ishizu Y, Kai C, Kasukawa T, Klinken P, Lassmann T, Lecellier CH, Lee W, Lizio M, Makeev V, Mathelier A, Medvedeva YA, Mejhert N, Mungall CJ, Noma S, Ohshima M, Okada-Hatakeyama M, Persson H, Rizzu P, Roudnicky F, S?trom P, Sato H, Severin J, Shin JW, Swoboda RK, Tarui H, Toyoda H, Vitting-Seerup K, Winteringham L, Yamaguchi Y, Yasuzawa K, Yoneda M, Yumoto N, Zabierowski S, Zhang PG, Wells CA, Summers KM, Kawaji H, Sandelin A, Rehli M, Hayashizaki Y, Carninci P, Forrest ARR, de Hoon MJL. An integrated expression atlas of miRNAs and their promoters in human and mouse. Nat Biotechnol. 2017; 35(9):872–8.Google Scholar
- 23.Mathelier A, Fornes O, Arenillas DJ, Chen CY, Denay G, Lee J, Shi W, Shyr C, Tan G, Worsley-Hunt R, Zhang AW, Parcy F, Lenhard B, Sandelin A, Wasserman WW. JASPAR 2016: a major expansion and update of the open-access database of transcription factor binding profiles. Nucleic Acids Res. 2016; 44(D1):110–15.CrossRefGoogle Scholar
- 24.Kulakovskiy IV, Vorontsov IE, Yevshin IS, Soboleva AV, Kasianov AS, Ashoor H, Ba-alawi W, Bajic VB, Medvedeva YA, Kolpakov FA, Makeev VJ. HOCOMOCO: expansion and enhancement of the collection of transcription factor binding sites models. Nucleic Acids Res. 2016; 44(D1):116–25. https://doi.org/10.1093/nar/gkv1249.CrossRefGoogle Scholar
- 30.Kaplan T, Li X-Y, Sabo PJ, Thomas S, Stamatoyannopoulos JA, Biggin MD, Eisen MB. Quantitative models of the mechanisms that control genome-wide patterns of transcription factor binding during early Drosophila development. PLoS Genet. 2011; 7(2):1001290. https://doi.org/10.1371/journal.pgen.1001290.CrossRefGoogle Scholar
- 33.Levo M, Zalckvar E, Sharon E, Machado ACD, Kalma Y, Lotam-Pompan M, Weinberger A, Yakhini Z, Rohs R, Segal E. Unraveling determinants of transcription factor binding outside the core binding site. Genome Res. 2015; 25(7):1018–29. https://doi.org/10.1101/gr.185033.114.PubMedPubMedCentralCrossRefGoogle Scholar
- 40.Thurman RE, Rynes E, Humbert R, Vierstra J, Maurano MT, Haugen E, Sheffield NC, Stergachis AB, Wang H, Vernot B, Garg K, John S, Sandstrom R, Bates D, Boatman L, Canfield TK, Diegel M, Dunn D, Ebersol AK, Frum T, Giste E, Johnson AK, Johnson EM, Kutyavin T, Lajoie B, Lee B-K, Lee K, London D, Lotakis D, Neph S, Neri F, Nguyen ED, Qu H, Reynolds AP, Roach V, Safi A, Sanchez ME, Sanyal A, Shafer A, Simon JM, Song L, Vong S, Weaver M, Yan Y, Zhang Z, Zhang Z, Lenhard B, Tewari M, Dorschner MO, Hansen RS, Navas PA, Stamatoyannopoulos G, Iyer VR, Lieb JD, Sunyaev SR, Akey JM, Sabo PJ, Kaul R, Furey TS, Dekker J, Crawford GE, Stamatoyannopoulos JA. The accessible chromatin landscape of the human genome. Nature. 2012; 489(7414):75–82. https://doi.org/10.1038/nature11232.PubMedPubMedCentralCrossRefGoogle Scholar
- 43.Schmidt F, Gasparoni N, Gasparoni G, Gianmoena K, Cadenas C, Polansky JK, Ebert P, Nordström K, Barann M, Sinha A, Fröhler S, Xiong J, Dehghani Amirabad A, Behjati Ardakani F, Hutter B, Zipprich G, Felder B, Eils J, Brors B, Chen W, Hengstler JG, Ham ann A, Lengauer T, Rosenstiel P, Walter J, Schulz MH. Combining transcription factor binding affinities with open-chromatin data for accurate gene expression prediction. Nucleic Acids Res. 2017; 45(1):54–66. https://doi.org/10.1093/nar/gkw1061.PubMedCrossRefGoogle Scholar
- 44.Huminiecki Ł, Horbańczuk J. Can We Predict Gene Expression by Understanding Proximal Promoter Architecture?Trends Biotechnol. 2017; 0(0). https://doi.org/10.1016/j.tibtech.2017.03.007.
- 47.Shrikumar A, Greenside P, Kundaje A. Learning important features through propagating activation differences. arXiv preprint arXiv:1704.02685. 2017.Google Scholar
- 48.Tibshirani R. Regression shrinkage and selection via the lasso. J R Stat Soc Ser B Methodol. 1996; 58(1):267–88.Google Scholar
- 49.Forrest AR, Kawaji H, Rehli M, Baillie JK, de Hoon MJ, Haberle V, Lassmann T, Kulakovskiy IV, Lizio M, Itoh M, Andersson R, Mungall CJ, Meehan TF, Schmeier S, Bertin N, J?rgensen M, Dimont E, Arner E, Schmidl C, Schaefer U, Medvedeva YA, Plessy C, Vitezic M, Severin J, Semple C, Ishizu Y, Young RS, Francescatto M, Alam I, Albanese D, Altschuler GM, Arakawa T, Archer JA, Arner P, Babina M, Rennie S, Balwierz PJ, Beckhouse AG, Pradhan-Bhatt S, Blake JA, Blumenthal A, Bodega B, Bonetti A, Briggs J, Brombacher F, Burroughs AM, Califano A, Cannistraci CV, Carbajo D, Chen Y, Chierici M, Ciani Y, Clevers HC, Dalla E, Davis CA, Detmar M, Diehl AD, Dohi T, Drabl?s F, Edge AS, Edinger M, Ekwall K, Endoh M, Enomoto H, Fagiolini M, Fairbairn L, Fang H, Farach-Carson MC, Faulkner GJ, Favorov AV, Fisher ME, Frith MC, Fujita R, Fukuda S, Furlanello C, Furino M, Furusawa J, Geijtenbeek TB, Gibson AP, Gingeras T, Goldowitz D, Gough J, Guhl S, Guler R, Gustincich S, Ha TJ, Hamaguchi M, Hara M, Harbers M, Harshbarger J, Hasegawa A, Hasegawa Y, Hashimoto T, Herlyn M, Hitchens KJ, Ho Sui SJ, Hofmann OM, Hoof I, Hori F, Huminiecki L, Iida K, Ikawa T, Jankovic BR, Jia H, Joshi A, Jurman G, Kaczkowski B, Kai C, Kaida K, Kaiho A, Kajiyama K, Kanamori-Katayama M, Kasianov AS, Kasukawa T, Katayama S, Kato S, Kawaguchi S, Kawamoto H, Kawamura YI, Kawashima T, Kempfle JS, Kenna TJ, Kere J, Khachigian LM, Kitamura T, Klinken SP, Knox AJ, Kojima M, Kojima S, Kondo N, Koseki H, Koyasu S, Krampitz S, Kubosaki A, Kwon AT, Laros JF, Lee W, Lennartsson A, Li K, Lilje B, Lipovich L, Mackay-Sim A, Manabe R, Mar JC, Marchand B, Mathelier A, Mejhert N, Meynert A, Mizuno Y, de Lima Morais DA, Morikawa H, Morimoto M, Moro K, Motakis E, Motohashi H, Mummery CL, Murata M, Nagao-Sato S, Nakachi Y, Nakahara F, Nakamura T, Nakamura Y, Nakazato K, van Nimwegen E, Ninomiya N, Nishiyori H, Noma S, Noma S, Noazaki T, Ogishima S, Ohkura N, Ohimiya H, Ohno H, Ohshima M, Okada-Hatakeyama M, Okazaki Y, Orlando V, Ovchinnikov DA, Pain A, Passier R, Patrikakis M, Persson H, Piazza S, Prendergast JG, Rackham OJ, Ramilowski JA, Rashid M, Ravasi T, Rizzu P, Roncador M, Roy S, Rye MB, Saijyo E, Sajantila A, Saka A, Sakaguchi S, Sakai M, Sato H, Savvi S, Saxena A, Schneider C, Schultes EA, Schulze-Tanzil GG, Schwegmann A, Sengstag T, Sheng G, Shimoji H, Shimoni Y, Shin JW, Simon C, Sugiyama D, Sugiyama T, Suzuki M, Suzuki N, Swoboda RK, ’t Hoen PA, Tagami M, Takahashi N, Takai J, Tanaka H, Tatsukawa H, Tatum Z, Thompson M, Toyodo H, Toyoda T, Valen E, van de Wetering M, van den Berg LM, Verado R, Vijayan D, Vorontsov IE, Wasserman WW, Watanabe S, Wells CA, Winteringham LN, Wolvetang E, Wood EJ, Yamaguchi Y, Yamamoto M, Yoneda M, Yonekura Y, Yoshida S, Zabierowski SE, Zhang PG, Zhao X, Zucchelli S, Summers KM, Suzuki H, Daub CO, Kawai J, Heutink P, Hide W, Freeman TC, Lenhard B, et al.A promoter-level mammalian expression atlas. Nature. 2014; 507(7493):462–70.Google Scholar
- 50.Andersson R, Gebhard C, Miguel-Escalada I, Hoof I, Bornholdt J, Boyd M, Chen Y, Zhao X, Schmidl C, Suzuki T, Ntini E, Arner E, Valen E, Li K, Schwarzfischer L, Glatz D, Raithel J, Lilje B, Rapin N, Bagger FO, J?rgensen M, Andersen PR, Bertin N, Rackham O, Burroughs AM, Baillie JK, Ishizu Y, Shimizu Y, Furuhata E, Maeda S, Negishi Y, Mungall CJ, Meehan TF, Lassmann T, Itoh M, Kawaji H, Kondo N, Kawai J, Lennartsson A, Daub CO, Heutink P, Hume DA, Jensen TH, Suzuki H, Hayashizaki Y, Muller F, Forrest AR, Carninci P, Rehli M, Sandelin A, Forrest AR, Kawaji H, Rehli M, Baillie JK, de Hoon MJ, Haberle V, Lassmann T, Kulakovskiy IV, Lizio M, Itoh M, Andersson R, Mungall CJ, Meehan TF, Schmeier S, Bertin N, J?rgensen M, Dimont E, Arner E, Schmid C, Schaefer U, Medvedeva YA, Plessy C, Vitezic M, Severin J, Semple CA, Ishizu Y, Young RS, Francescatto M, Alam I, Albanese D, Altschuler GM, Arakawa T, Archer JA, Arner P, Babina M, Rennie S, Balwierz PJ, Beckhouse AG, Pradhan-Bhatt S, Blake JA, Blumenthal A, Bodega B, Bonetti A, Briggs J, Brombacher F, Burroughs AM, Califano A, Cannistracti CV, Carbajo D, Chen Y, Chierici M, Ciani Y, Clevers HC, Dalla E, Davis CA, Detmar M, Diehl AD, Dohi T, Drabl?s F, Edge AS, Edinger M, Ekwall K, Endoh M, Enomoto H, Fagiolini M, Fairbairn L, Fang H, Farach-Carson MC, Faulkner GJ, Favorov AV, Fisher ME, Frith MC, Fujita R, Fukuda S, Furlanello C, Furuno M, Furusawa J, Geijtenbeek TB, Gibson AP, Gingeras T, Goldowitz D, Gough J, Guhl S, Guler R, Gustincich S, Ha TJ, Hamaguchi M, Hara M, Harbers M, Harshbarger J, Hasegawa A, Hasegawa Y, Hashimoto T, Herlyn M, Hitchens KJ, Ho Sui SJ, Hofman OM, Hoof I, Hori F, Huminiecki L, Iida K, Ikawa T, Jankovic BR, Jia H, Joshi A, Jurman G, Kaczkowski B, Kai C, Kaida K, Kaiho A, Kajiyama K, Kanamori-Katayama M, Kasianov AS, Kasukawa T, Katayama S, Kato S, Kawaguchi S, Kawamoto H, Kawamura YI, Kawashima T, Kempfle JS, Kenna TJ, Kere J, Khachigian LM, Kitamura T, Klinken SP, Knox AJ, Kojima M, Kojima S, Kondo N, Koseki H, Koyasu S, Krampitz S, Kubosaki A, Kwon AT, Laros JF, Lee W, Lennartsson A, Li K, Lilje B, Lipovich L, Mackay-Sim A, Manabe R, Mar JC, Marchand B, Mathelier A, Mejhert N, Meynert A, Mizuno Y, de Lima Morais DA, Morikawa H, Morimoto M, Moro K, Motakis E, Motohashi H, Mummery CL, Murata M, Nagao-Sato S, Nakachi Y, Nakahara F, Nakamura T, Nakamura Y, Nakazato K, van Nimwegen E, Ninomiya N, Nishiyori H, Noma S, Nozaki T, Ogishima S, Ohkura N, Ohmiya H, Ohno H, Onshima M, Okada-Hatakeyama M, Okazaki Y, Orlando V, Ovchinnikov DA, Pain A, Passier R, Patrikakis M, Persson H, Piazza S, Prendergast JG, Rackham OJ, Ramilowski JA, Rashid M, Ravasi T, Rizzu P, Roncador M, Roy S, Rye MB, Saijyo E, Sajantila A, Saka A, Sakaguchi S, Sakai M, Sato H, Satoh H, Savvi S, Saxena A, Schneider C, Schultes EA, Schultz-Tanzil GG, Schwegmann A, et al.An atlas of active enhancers across human cell types and tissues. Nature. 2014; 507(7493):455–61.Google Scholar
- 51.Yin Y, Morgunova E, Jolma A, Kaasinen E, Sahu B, Khund-Sayeed S, Das PK, Kivioja T, Dave K, Zhong F, Nitta KR, Taipale M, Popov A, Ginno PA, Domcke S, Yan J, Schubeler D, Vinson C, Taipale J. Impact of cytosine methylation on DNA binding specificities of human transcription factors. Science. 2017; 356(6337). https://doi.org/10.1126/science.aaj2239.
- 52.Zhao P, Yu B. On model selection consistency of lasso. J Mach Learn Res. 2006; 7(Nov):2541–63.Google Scholar
- 55.Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, Mesirov JP. Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci U S A. 2005; 102(43):15545–50. https://doi.org/10.1073/pnas.0506580102.PubMedPubMedCentralCrossRefGoogle Scholar
- 59.Dao LTM, Galindo-Albarran AO, Castro-Mondragon JA, Andrieu-Soler C, Medina-Rivera A, Souaid C, Charbonnier G, Griffon A, Vanhille L, Stephen T, Alomairi J, Martin D, Torres M, Fernandez N, Soler E, van Helden J, Puthier D, Spicuglia S. Genome-wide characterization of mammalian promoters with distal enhancer functions. Nat Genet. 2017; 49(7):1073–1081. https://doi.org/10.1038/ng.3884. Epub 2017 Jun 5.PubMedCrossRefGoogle Scholar
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.