De novo prediction of cis-regulatory elements and modules through integrative analysis of a large number of ChIP datasets
- 2.6k Downloads
In eukaryotes, transcriptional regulation is usually mediated by interactions of multiple transcription factors (TFs) with their respective specific cis-regulatory elements (CREs) in the so-called cis-regulatory modules (CRMs) in DNA. Although the knowledge of CREs and CRMs in a genome is crucial to elucidate gene regulatory networks and understand many important biological phenomena, little is known about the CREs and CRMs in most eukaryotic genomes due to the difficulty to characterize them by either computational or traditional experimental methods. However, the exponentially increasing number of TF binding location data produced by the recent wide adaptation of chromatin immunoprecipitation coupled with microarray hybridization (ChIP-chip) or high-throughput sequencing (ChIP-seq) technologies has provided an unprecedented opportunity to identify CRMs and CREs in genomes. Nonetheless, how to effectively mine these large volumes of ChIP data to identify CREs and CRMs at nucleotide resolution is a highly challenging task.
We have developed a novel graph-theoretic based algorithm DePCRM for genome-wide de novo predictions of CREs and CRMs using a large number of ChIP datasets. DePCRM predicts CREs and CRMs by identifying overrepresented combinatorial CRE motif patterns in multiple ChIP datasets in an effective way. When applied to 168 ChIP datasets of 56 TFs from D. melanogaster, DePCRM identified 184 and 746 overrepresented CRE motifs and their combinatorial patterns, respectively, and predicted a total of 115,932 CRMs in the genome. The predictions recover 77.9% of known CRMs in the datasets and 89.3% of known CRMs containing at least one predicted CRE. We found that the putative CRMs as well as CREs as a whole in a CRM are more conserved than randomly selected sequences.
Our results suggest that the CRMs predicted by DePCRM are highly likely to be functional. Our algorithm is the first of its kind for de novo genome-wide prediction of CREs and CRMs using larger number of transcription factor ChIP datasets. The algorithm and predictions will hopefully facilitate the elucidation of gene regulatory networks in eukaryotes. All the predicted CREs, CRMs, and their target genes are available at http://bioinfo.uncc.edu/mniu/pcrms/www/.
Keywordscis-regulatory elements cis-regulatory modules ChIP-chip ChIP-seq Drosophila melanogaster
Co-occurring pair clusters
Markov chain clustering
Position specific weight matrixes
Since the completion of sequencing the first metazoan genomes in 1998 , more than 311 important metazoan and plant genomes have been sequenced thus far , and enormous efforts have been made to understand how biological functions and diseases of these organisms including the humans can be explained by the genetic information stored in the genome sequences. Although significant progress has been made in the past 16 years, we are still far from the goal of understanding the biology of metazoans and plants solely from their genome sequences . In fact, it turns out that interpreting a genome is more difficult and challenging than originally thought when a few eukaryotic genomes including the human genome were first released [3, 4]. With this recognition, the community has taken a more realistic approach by first identifying all the functional sequence elements in the genomes [5, 6, 7]. These functional elements include transcribed sequences as well as transcriptional control elements, epigenetic features, and regulatory elements acting at the RNA level post-transcriptionally. In principle, while the transcribed sequences specify the potential part list in the cells in an organism, including proteins, various types of RNAs and metabolites, the transcriptional control elements including promoters, enhancers, silencers and insulators together with epigenetic remodeling machineries, determine which protein- or RNA-specifying sequences should be transcribed in each cell during development and under various physiological conditions, thereby specifying the cell’s type during development and specific physiological functions, as it is the dynamic interactions of these components in a cell that determine the cell’s type and specific physiological functions . Once these functional elements are at least partially known, then we can move toward to the next step to identify dynamic interactions among the functional sequence elements and their products of proteins, RNAs and metabolites in different cell types in the entire life of the organism.
In the past we have gained a good understanding of transcribed sequences, particularly protein-coding sequences in numerous sequenced eukaryotic genomes thanks to the development of powerful computational and experimental methods for their characterization . However, we have had only very limited understanding of transcriptional control elements, particularly promoters, enhancers and silencers in virtually all sequenced large eukaryotic genomes, even though these elements are as important as the transcribed sequences for the functions of an organism [10, 11, 12]. More specifically, promoters, enhancers and silencers are clusters of closely located cis-regulatory elements (CREs) that are recognized by specific transcription factors (TFs) . Thus, a CRE is also called a TF binding site (In this paper, we will refer to a set of similar CREs recognized by the same TF as a motif). These clusters of CREs are also called cis-regulatory modules (CRMs) . The difficulty to identify CREs and CRMs either computationally or experimentally is due mainly to their short and degenerate nature while they mainly reside in very long intergenic or intronic background sequences . To further confound the problem, they can be very far away from the target genes or even can be located on a different chromosome , making their characterization extremely difficult by computational methods such as comparative genomics approaches, although there are successful examples, in particular for developmental enhancers that tend to be more conserved [16, 17].
However, in the past a few years, the development of a plethora of next-generation sequencing (NGS)-based high throughput techniques has largely changed the way to characterize CREs or even CRMs genome-wide in large eukaryotic genomes. These techniques include ChIP-chip and ChIP-seq for locating CREs of a TF [18, 19, 20] and various chromatin modification marks , DNase-seq [22, 23, 24] and FAIRE-seq  for locating free nucleosome regions which tend to coincide with active CRMs, and Hi-C for measuring the physical proximity of linearly distal DNA segments [25, 26]. While a single epigenetic dataset derived from DNase-seq [22, 23, 24], FAIRE-seq  or enhancer mark ChIP-seq potentially contains location information of all CRMs active in a cell or tissue type, CREs and CRMs for specific TFs cannot be easily identified in such a dataset, as it lumps all CREs and CRMs active in the cell or tissue type. In contrast, a TF ChIP-seq dataset is highly enriched for the CREs of the TF, thus they can be potentially identified at single nucleotide resolution in a cell or tissue or type. However, the sequenced potential binding regions in a TF ChIP-seq dataset can be still much longer than the CREs of the ChIP-ed TF, thus peak-calling algorithms and tools have been developed to identify the binding peaks in the potential binding regions. Even though the existing peak-calling algorithms can narrow down CREs of a ChIP-ed TF to a certain regions, typically from a few hundred to a few thousand base pairs (bp) , they are still much longer than the typical lengths of CREs, which are typically 6 ~ 16 bp long. Hence, the actual locations of CREs need to be identified by a motif-finding tool [28, 29]. Although a few new motif-finders have been developed to analyze large sequence sets from ChIP-seq experiments, such as seeder , Trawler [30, 31], ChIPMunk , HMS , CMF , STEME , DREME , DECOD , RSAT , and POSMO , they are typically used to find the CREs of a ChIP-ed TF in a short region of sequences (~200 bp) around the binding peak summits in order to reduce the searching space and increase prediction specificity in trading of sensitivity. Some of these tools [33, 39] use the locations of binding peaks to help find the CREs of a ChIP-ed TF. Thus only CREs of the ChIP-ed TF are returned by these tools. However, CREs in higher eukaryotes rarely work alone, instead, they cooperate with one another by forming CRMs for combinatorial regulations . It has been shown that CREs of cooperative TFs of a ChIP-ed TF can be found in the neighborhoods of the binding peaks of the ChIP-ed TF [40, 41, 42, 43, 44]. In this sense, the information of CREs in a ChIP dataset is not fully explored by the majority of current studies that were mainly targeted to identify the CREs of a ChIP-ed TF.
With the continuous drop in costs of NGS technologies, TF ChIP-seq is becoming routine in numerous individual labs worldwide, and enormous ChIP-seq datasets are being produced in many important metazoans and plants, in addition to the large amount of ChIP data churned out by large consortiums such as the ENCODE [5, 45] and modENCODE  projects aimed at identifying all the functional sequence elements in the genomes of humans and the model organisms C. elegans  and D. melanogaster [42, 46]. It is highly expected that very soon, at least one ChIP-seq dataset will be available in a certain cell type, tissue or developmental stage for the majority of TFs encoded in the genomes through these efforts. Since certain combinations of TFs are often repeatedly used for regulating one or more groups (regulons) of genes in some cell types, tissues and developmental stages , the increasing number of ChIP-seq datasets contains a wealth of information about the combinatorial patterns of different TFs for transcriptional regulation [42, 43]. Thus, it is now possible to predict the CREs and CRMs genome-wide through integrating the information about co-occurrence of motifs in a large number of ChIP-seq datasets for different TFs from different cell types, tissues, developmental stages and physiological conditions. Although a few methods such as SpaMo , CPModule  and , have been made to identify CREs of cooperator TFs in a ChIP-seq dataset, they do not integrate multiple ChIP-seq datasets, and cannot predict novel motifs in CRMs, as they all depend on a library of known CREs such as TRANSFAC  or JASPAR  to scan for possible cooperative CREs in binding peaks. Consequently, simple and approximate methods were often used to find motifs in big ChIP datasets. For instance, in recent studies using the modENCODE  and ENCODE [50, 51] datasets, only the top 250 and 500 binding peaks with a length of 100 bp and 200 bp, respectively, in each dataset were used to find motifs. Hence, the wealthy information in the valuable ChIP datasets was not fully explored.
In this paper, we have developed a new algorithm DePCRM for genome-wide [de] novo [p]rediction of [CRMs] and CREs by identifying overrepresented patterns of motif combinations in a large number of ChIP datasets in a sequenced eukaryotic organism. When applied to the D. melanogaster genome using a total of 168 ChIP-chip and ChIP-seq datasets for 56 TFs, DePCRM identified 184 CRE motifs and 115,932 CRMs, recovering 77.9% of known CRMs located in the datasets and 89.3% of known CRMs containing at least one predicted CRE. Thus the algorithm has achieved rather high prediction accuracy even using this limited number of datasets.
Basic idea of the algorithm
Overlap of the extended binding peaks of cooperative TFs in the datasets
Identification of motifs in the extended binding peaks
Summary of the predictions of CREs and CRMs in the D. melanogaster genome at the major steps of the algorithm
Prediction of CRMs by iteratively enriching repeatedly used motif combinatorial patterns
Next, to identify larger repeatedly used motif patterns, we computed a co-occurring score S CPC (formula (5)) for each pair of CPCs across the datasets in which both the CPCs have motifs. Interestingly, as shown in Figure 7C, the S CPC scores display a well-separated bimodal distribution, and the low-scoring peak is likely mainly due to random motif patterns, while the high-scoring one is more likely attributable to truly cooperative motifs, thus we considered CPC pairs with an S CPC ≥ γ =0.69 (at the valley between the two peaks) for further analysis. Applying the MCL algorithm to the resulting CPC co-occurring graph (Figure 2D and E, Methods), gave rise to 815 CRM components (CRMCs) containing 937(937/951 = 98.5%) CPCs, 2,807(2,807/2,842 = 98.8%) CPs and 1,316 (1,316/1,376 = 95.6%) motifs (Table 1). The compositions and structures of these 815 CRMCs are shown in Additional file 6: Figure S3, each containing 1 ~ 9 CPCs. Overall, 16,574 (92.6%) of the original 17,890 input motifs were filtered out by the algorithm (Table 1), suggesting that at least the vast majority (92.6%) of the putative motifs found in the datasets are spurious predictions.
As expected, some of the resulting 1,316 motifs found in different datasets are highly similar and often overlap with one another as demonstrated by the examples shown in Figure 5B. They are likely recognized by the same TFs or closely related ones, thus need to be combined into non-redundant and unique ones. To this end, we iteratively clustered the final 1,316 motifs based on their similarities (Methods), resulting in 184 clusters. We consider each cluster as a unique motif and refer to it as a Umotif, each containing 1 or 2 ~ 108 highly similar motifs and 255 ~ 88,702 CREs (Additional file 7: Figure S4, Additional file 8: Table S3). When compared with the known motifs in multiple built-in databases including DMMPMM, iDMMPMM, flyreg and fly factor survey using TOMTOM [62, 63, 64, 65], 111 (60.3%) of the Umotifs are highly similar to known motifs in D. melanogaster at p < 0.001 (Additional file 8: Table S3), strongly suggesting that they are likely to be true motifs. As shown in Additional file 9: Figure S9, a p-value cutoff of 0.001 is sufficient to identify highly similar motifs. More examples of such Umotifs, their constituent motifs and the known motifs hit are shown in Additional file 10: Figures S5A and 5B. The rest 73 Umotifs that does not resemble any known motif might be novel ones. Examples of such Umotifs, their constituent motifs are shown in Additional file 10: Figures S5C and 5D. Furthermore, 106 (29.4%), 203 (56.2%) and 269 (74.5%) of 381 possibly redundant motifs found in the earlier study  were recovered by the Umotifs with a p-value cutoff of 0.001, 0.005 and 0.01, respectively. We replaced the motifs in the CRMCs with the Umotifs that they belong to, and each of the CRMCs is represented by their constituent Umotifs. Some CRMCs contain the same combination of Umotifs, thus we merged them in a unique one, resulting in 746 CRMCs.
Genome-wide predictions of CREs and CRMs in D. melanogaster
To evaluate the sensitivity of our predicted CRMs, we first computed the recovery rate by the predicted CRMs of the 1,330 known CRMs contained in the datasets. We consider a known CRM is recovered if it overlaps with a predicted CRM by at least half of its length. Remarkably, 1,036 (77.9%) of the 1,330 known CRMs were recovered by the 115,932 putative CRMs (Table 1). By contrast, when the same number and length sequences were randomly selected from the genome region covered by the datasets, only 46.9 ± 5.5% (n = 50) (Figure 8C) of the 1,330 known CRMs could be recovered. The recovery rate for the 1,061 known CRMs, in which at least a putative CRE was found, was even higher (947/1,061 = 89.3%). By contrast, when the same number and length sequences were randomly selected from the genome region covered by the dataset, only 57.2 ± 7.6% (n = 50) (Figure 8C) of the known CRMs were recovered. Hence, our algorithm has achieved rather a high recovery rate or sensitivity of CRM predictions, in particular when a putative CRE could be identified in them, even using just the limited number of datasets for only 56 TFs. Importantly, some of known CREs in these recovered CRMs overlap with our predicted CREs. For example, CRM(3R:21859748.. 21862775) containing Umotif 34 recovers a known CRM of gene e(spl); and a putative CRE of Umotif 34 overlaps with the known CRE of TF DA in the CRM, while Umotif 34 is highly similar to the known motif of DA (Additional file 11: Figure S6A). Furthermore, CRM (2 L: 15731775..15732968) containing Umotifs 106 and 114 recovers the known CRM of gene cycE; moreover, Umotifs 106 and 114 are highly similar to the known motifs of HTH and KNI which also have CREs located in the recovered CRM, respectively (Additional file 11: Figure S6B and S6C). In addition, many of our novel predictions also have strong experimental data supports thus are likely to be authentic. For example, our predicted CRMs 3R:8896195..8898063 , 3R: 12636031..12636729 and 2R: 5984055..5984519 share Umotifs 3 and 14, and they recover the known CRMs of genes abd-A, jun-realted antigen (jra) and single-minded (sim). It has been shown that these three genes are involved in nervous system development [66, 67, 68], and thus are likely to be coregulated. Consistent with this, we identified CREs of Umotifs 3 and 14 in the regulatory regions of these genes. Interestingly, Umotifs 3 and 14 are highly similar to the known motifs of hormone receptor 51 (HR51) and ladybird early (LBE), receptively (Additional file 11: Figures S6D and S6E), and it has been reported that HR51 and LB regulate neurogenesis [69, 70]. Thus HR5 and LB might carry out their functions by binding to the putative CREs of Umotifs 3 and 14. Furthermore, we have predicted a CRM 2R: 16831599..16832019 overlaps with the first intron of gene actin57B (Additional file 12: Figure S7) containing Umotif 27 and 23, which are highly similar to the known motifs of TFs myocyte enhancer factor 2 (MEF2) and chorion factor 2 (CF2), respectively (Additional file 11: Figures S6F and S6G). It has been shown that these two TFs cooperatively regulate Actin57B by binding to its promoter region . Thus MET2 and CF2 might also regulate actin57B through binding to the putative CRES of Umotifs 27 and 23 located in its first intron (Additional file 12: Figure S7). Therefore, our predicted CREs and CRMs can help biologists identify potential enhancers for genes of interest.
The predicted CRMs as well as CREs in a CRM as a whole are more conserved than randomly selected sequences
Highly conserved and non-conserved CRMs regulate distinct classes of genes
To further evaluate our predicted CRMs, we examined whether or not the highly conserved predicted CRMs (with an average phastCons score ≥0.98) and highly non-conserved predicted CRMs (with an average phastCons score ≤0.02) (Figure 9A) have distinct regulatory functions. To this end, we assigned each of the predicted CRMs a target gene whose transcription start site has the shortest distance to the predicted CRM. Thus a predicted CRM can only be assigned to a gene, while a gene can have multiple assigned putative regulating CRMs. A total of 763 and 2,319 genes are predicted as targets of the highly conserved and highly non-conserved putative CRMs, of which 601 and 2,053 have gene ontology (GO) annotations, respectively. As shown in Additional file 13: Table S4, 134 (22.3%) the putative target genes of the 601 highly conserved putative CRMs are clustered into 11 functional groups using the DAVID program  with an enrichment score ≥1.5 and p < 0.01 (hypergeometric test with Benjamini correction). Intriguingly, these genes are enriched for developmental functions (8 groups), neurological functions (1 group), motility (1 group) and transcriptional regulations (1 group). On the other hand, 481(23.4%) putative target genes of the 2,053 highly non-conserved putative CRMs are clustered into 10 functional groups with an enrichment score ≥1.5 and p < 0.01, In contrast to the putative target genes of highly conserved putative CRMs, these genes are enriched for plasma membrane functions (6 groups), metabolism (2 groups), and chemical sensory perception (2 groups) (Additional file 14: Table S5). Thus, the highly conserved putative CRMs and highly non-conserved putative CRMs do regulate distinct groups of genes. The results are in excellent agreement with the fact that highly conserved CRMs are mainly involved in embryonic development in both insects [81, 82] and vertebrates , while CRMs for genes with other functions in particular those related to environmental adaptations evolve extremely fast , strongly suggesting that both the highly conserved putative and non-conserved putative CRMs are likely to be functional. The predicted CREs, Umotifs, CRMs, average phastCons scores and putative target genes are stored in a searchable relational database PCRMs (http://bioinfo.uncc.edu/mniu/pcrms/www/) for public use. The query results and relevant knowledge are displayed using the NCBI graphical sequence viewer. Currently, the algorithm was implemented in Perl, and the scripts are available on the PCRMs website.
The recent development of various ChIP-seq, DNase-seq and FAIRE-seq techniques for locating bind regions of specific TFs, chromatin marks, free nucleosome regions, has provided an unprecedented opportunity for deciphering all cis-regulatory sequences in eukaryotic genomes. These techniques and resulting datasets reveal similar or quite different aspects of cis-regulatory sequences, and have their pros and cons. On one hand, a single epigenomic dataset resulted from DNase-seq, FAIRE-seq, or enhancer mark such as H3K27ac ChIP-seq provide information of the locations of all functional CRMs in a cell or tissue type, thus these techniques could be less expensive. However, it is very difficult to predict novel CREs of specific TFs from such an epigenetic dataset, since it lumps the potential CREs for all TFs active in the cell or tissue type, and TF information is usually unavailable. Due to the lack of CRE locations, it is also difficult to predict CRMs at single nucleotide resolution using epigenetic datasets. On the other hand, a TF ChIP-seq dataset is highly enriched for the CREs of the ChIP-ed TF and of its cooperators, thus all these CREs can be potentially identified at single nucleotide resolution in a cell or tissue type. However, as a TF ChIP-seq dataset only contains location information of CREs of the ChIP-ed TF, a certain number of TFs that are potentially active in the cell or tissue type need to be analyzed to identify all the CREs and CRMs. Nevertheless, to fully understand the cis-regulatory genome and also for a wide spectrum of applications, it is necessary to real the exact locations of all CREs and CRMs in the genome. ChIP-seq datasets for various TF in different cell and tissue types can be the key to the goal.
However, precise identification of CREs in the binding peaks from ChIP experiments is still a challenging computational problem . Efforts have been made to narrow down the binding peaks through improving experimental procedures , thereby facilitating the identification of CREs. On the other hand, once the binding peak summits of a TF are identified, information about the CREs of its cooperative TFs around the summits can provide a good opportunity to identify the relevant CRMs. With the accumulation of a large number of ChIP datasets in many important metazoans and plants, it is tantalizing to predict CRMs around CREs of the ChIP-ed TFs by integrating information in a large number of ChIP datasets in an organism. In this study, we have explored this idea and developed a novel algorithm DePCRM for such a purpose. The algorithm is largely based on the fact that similar TF combinatorial patterns are often repeatedly used to regulate multiple similar or different regulons in different cell types, tissues, developmental stages or physiologically conditions. As the number of possible combinations of TFs is extremely large, DePCRM identifies possible real motif combinatorial patterns in a sufficiently large number of ChIP datasets through iteratively filtering out randomly occurring spurious motifs, thereby effectively reducing the searching space in each step (Table 1). Clearly, in order for the algorithm to make reasonable predictions, the ChIP datasets have to be sufficiently large and diverse, so that they are likely to include datasets for cooperative TFs in different cell types, tissues, developmental stages and physiological conditions.
Using the currently available 168 ChIP datasets for 56 TFs in D. melanogaster, the algorithm was able to recover 77.9% of the known CRMs in the datasets and even 89.3% known CRMs in which a putative CRE could be identified (Table 1). Thus our algorithm has achieved rather high prediction sensitivity even only using these limited 168 datasets, in particular when a putative CRE can be located in the CRMs by a motif finding tool. Although we cannot rigorously evaluate the prediction specificity of the algorithm due to the limited knowledge of CRMs in the genome, it should not be too low for the following reasons. First, the chance for such high recovery rate of known CRMs to happen by chance is virtually impossible as indicated by our simulation studies (Figure 8C). Second, our predicted CRMs as well as CREs in a CRM as a whole are more conserved than the corresponding randomly selected sequences (Figure 9A and B). Third, the highly conserved predicted CRMs tend to be located in the close neighborhoods of genes involved in embryonic development (Additional file 13: Table S4), which is consistent with the existing knowledge [81, 82, 83]. Fourth, the highly non-conserved predicted CRMs tend to be located in the close neighborhoods of genes involved in neural transmission, chemical sensation and metabolism, which is also in excellent agreement with the observations that gene regulatory networks for genes involved in responses to environmental factors tend to evolve very rapidly through rewiring by degrading existing CREs (death), or gaining new CREs (birth), a process called CRE turnover [84, 86]. This form of genetic changes plays a more pivotal role in functional evolution of organisms than previously thought [84, 87]. Therefore, both the conserved and non-conserved putative CRMs are highly likely to be functional.
As vast majority of known CRMs are located in NCRs, we did not attempt to predict CRMs that are entirely located in CDRs, thus, we only allow the extended binding peaks to include at most the adjacent exon (Methods). Nevertheless, 5.9% of our predicted CRMs at least partially include the first or last exon of genes. Although putative CREs in CDRs are more likely to be conserved than those in NCRs (Figure 9C and D), they are not more conserved than the randomly selected short k-mers in CDRs (Figure 9D). Therefore putative CREs in CDRs are not necessarily under a higher selection pressure than are the randomly selected short k-mers in CDRs. On the other hand, the other 94.1% of our predicted CRMs are entirely located in NCRs (Figure 4), and consist of 34.9% of all NCRs in the genome. Interestingly, it has been shown that there are more than three times as many functional NCRs as CDRs in the D. melanogaster genome, because these NCRs are under at least the same level of natural selection as CDRs [73, 75, 79]. In other words, more than 75% of NCRs in the genome are likely to be functional. In this regarding, we have predicted less than half of possible CRMs in the genome. Furthermore, our predicted CRMs are based on 746 combinatorial patterns (i.e., CRMCs) of 184 identified Umotifs. Since TFs of the same structural family tend to recognize highly similar motifs [88, 89], our predicted Umotif might correspond to multiple highly similar motifs of different TFs of the same structural family. Hence, we may have actually predicted more than 184 motifs for some of the 1,052 annotated TFs in the genomes, and many of them are likely novel motifs. However, our predicted motifs might be far away from covering all the annotated TFs as our predicted CRMs only cover 34.9% of NCRs, of which at least 75% are likely to have transcriptional regulatory functions [73, 75, 79].
Nonetheless, our results demonstrate that even these limited 168 datasets for just 56 TFs can result in highly meaningful predictions of CRMs and CREs genome-wide. In other words, these datasets contain sufficient information for repeatedly used motif patterns as indicated by the significant overlaps of their binding peaks (Figure 4 and Additional file 5: Figure S2). On the other hand, because these datasets were not generated by random efforts of the community, rather, they are likely biased to well-studied cooperative TFs, and their CRMs are relatively well documented in the literature. Therefore, if the datasets were generated by random efforts and the known CRMs were characterized by uncorrelated efforts, then we might need a much larger number of datasets to achieve the similar prediction accuracy. Moreover, as indicated above, although we have achieved a rather high recovery rate (89.3%) of known CRMs with a putative CRE, more and diverse ChIP datasets are needed to further improve the predictions, in particular to predict all CRMs in the genome. Fortunately, with ChIP-seq techniques becoming routine and the progress of the ENCODE projects, more and more ChIP-seq datasets will be churned out for numerous and even all TFs encoded in the organisms. Thus our algorithm could be very useful for elucidating CRMs encoded any genome once a sufficient number of diverse ChIP-seq datasets become available in the organism.
Clearly, our predicted result is only a static map of CREs and CRMs encoded in the genome, and for many putative CREs in the predicted CRMs, we may not know their cognate TFs and functional states (active, poised or inactive) in specific cell types and tissues. However, once such a global CRMs map is available for an organism, it is relatively straightforward to infer the functional states to CRMs if epigenetic data in a certain cell type, tissue, developmental stage or physiological condition are available, such as ChIP-seq data for histone modification markers (e.g., mono-, bi- and tri-methylation at lysine 4 of histone 3 or H3K4m1, H3K4m2, K3K4m3, etc.) at active promoters, enhancers and silencers [7, 21, 90, 91, 92, 93], and DNAse-seq data for nucleosome free regions [22, 24, 85, 94, 95]. In this sense, various epigenetic datasets in different cell and tissue types can complement with TF ChIP datasets and speed up the process of deciphering the entire cis-regulatory genome of an organism. Thus, a future development is to incorporate the epigenetic datasets, hereby predicting the functional states of all the predicted CRMs in a certain cell type, tissue, developmental stage or physiological condition [90, 91, 92, 93]. Then, it is also possible to predict the molecular, cellular and organismal phenotypes based on the functional states of the CRMs and their variations among individuals and species, given the recent indication of the importance of CRMs in determining the phenotypes of organism [96, 97, 98, 99, 100, 101, 102, 103, 104].
The exponentially increasing number of TF binding location data produced by the recent wide adaptation of chromatin immunoprecipitation coupled with microarray hybridization (ChIP-chip) or high-throughput sequencing (ChIP-seq) technologies has provided an unprecedented opportunity to identify CRMs and CREs in genomes. However, how to effectively mine the large volumes of ChIP data to identify CREs and CRMs is a challenging task. We have developed a novel graph-theoretic based algorithm DePCRM for genome-wide de novo predictions of CRMs and CREs using a large number of ChIP datasets. DePCRM predicts CRMs by identifying overrepresented combinatorial motif patterns in multiple ChIP datasets in an effective way. When applied to 168 ChIP datasets of 56 TFs from D. melanogaster, DePCRM identified 184 and 746 overrepresented motifs and their combinatorial patterns, respectively, and predicted a total of 115,932 CRMs in the genome. The predictions recover 77.9% of known CRMs in the datasets, 89.3% of known CRMs containing at least one predicted CRE. These putative CRMs and CREs as a whole in a CRM are more conserved than randomly selected sequences, thus, they are highly likely to be functional. Thus the algorithm can be used to predict CRMs and CREs in other eukaryotic genomes from which a sufficient number of diverse ChIP datasets are available. All the predicted CREs, motifs, CRMs, and their target genes are available at http://bioinfo.uncc.edu/mniu/pcrms/www/.
We attempted to collect all possible ChIP-seq and ChIP-chip datasets from D. melanogaster available to us from three sources: the modENCODE project , the Berkeley drosophila transcription network project (BDTNP)  and literature. We used the binding peak summits in each dataset, provided in the original publications, as the data owners might have a better understanding of their datasets for background subtraction and normalization. We removed binding peaks that overlap with high occupancy target (HOT) regions [42, 43]. Because the typical lengths of known CRMs are 1,000-2,000 bp , we extended the binding peaks shorter than 3,000 bp to up to 3,000 bp by padding equal length of flanking genomic sequences to the two ends. If the extension on either end reaches to an adjacent exon, we only included up to the full length sequence of the exon as majority of CRMs are located in NCRs. We discarded the binding peaks longer than 5,000 bp as they generally have low quality score and consist of only a small portion in the datasets (Figure 3A). The remaining extended binding peaks in each dataset were used for motif finding. The known CREs and CRMs in D. melanogaster were downloaded from the REDfly database .
Measurement of the overlap of binding peaks in two datasets
where |d i | and |d j | are the number of binding peaks in d i and d j , respectively, and o i (d i , d j ) the number of sequences that have at least one pb overlap between the sequences in the two datasets.
Finding motifs in binding peak datasets
Based on an initial evaluation of multiple motif-finding tools for large ChIP datasets, including seeder , Trawler [30, 31], ChIPMunk , HMS , CMF , STEME , DREME , DECOD , RSAT , and POSMO , we selected DREME to identify all possible motifs in each of the extended binding peak dataset for its computational efficiency and capability to return enough number of over-represented motifs in a dataset . As DREME requires a negative dataset for more accurate predictions, we generated a random sequence set for each input dataset using a third order Markov chain model based on the transition probabilities of the sequences in the dataset. In addition, since it is highly unlikely that one can find a large number of high quality motifs in such a random dataset or in a low quality ChIP dataset, we also used DREME as a quality control measure to filter out low quality datasets in which no or only a single motif could be identified.
Our DePCRM algorithm predicts CRMs through the following steps using the putative motifs as the input found in the modified binding peaks from all ChIP-seq and/or ChIP-chip datasets.
Step 1 identify co-occurring motif pairs (CPs) in each dataset
where |M d (i)| and |M d (j)| are the number of binding peaks containing CREs of motifs M d (i) and M d (j), respectively; and o(M d (i), M d (j)) the number of binding peaks containing CREs of both the motifs. We select motif pairs with a S c ≥ α as co-occurring motif pairs (CPs) for further analysis (Figure 2B and C). The cutoff α is chosen such that the predicted motifs in known CRMs are minimally excluded (Figure 5A and B). If there are not enough known CRMs in the genome, a default α = 0.7 is used based on the data from REDfly (see Results).
Step 2 compute similarity scores among all pairs of CPs in different datasets
where Sim(M, N) is the similarity score between motifs M and N using a metric called SPIC that we proposed previously considering both the frequency matrixes and position specific weight matrixes (PSWMs) of both the motifs [105, 106, 107]. We have shown that SPIC outperforms the existing metrics for measuring motif similarities [105, 106, 107]. Note that to compute S s we first select the highest similarity among all the four possible motif pairs, and then sum it with the similarity of the remaining pair.
Step 3 construct the CP similarity graph
where |CP| and |E| are the numbers of CPs and edges in the graph, respectively. We choose an β value such that the resulting graph is as spars as possible and has as many nodes/CPs as possible (Figure 7A and B).
Step 4 cut the CP similarity graph into dense sub-graphs, CP clusters (CPCs)
We use the Markov Chain Clustering algorithm (MCL)  to cut the graph into dense sub-graphs, each corresponding to a cluster of repetitively occurring CPs across multiple datasets (Figure 2D). MCL iteratively computes random walks determined by a Markov chain by alternately executing two operations (expansion and inflation) on a stochastic matrix . It ranks the identified dense sub-graphs according to their sizes in a descending order. It has been shown that MCL works very well in finding dense sub-graphs in very large weighted sparse graphs [61, 105, 106, 108, 109, 110, 111, 112]. We discard the clusters containing fewer than τ CPs (τ = 2 in this study, τthus we only discarded singleton CPs) (Figure 2D). Presumably, the remaining clusters contain highly similar CPs for certain two TFs. For example, cluster C1 (P1, P5, P8) in Figure 2D contains highly similar motifs (red and black ova) for two distinct TFs. For this reason we call these clusters CP clusters (CPCs) (Figure 2D).
Step 5 compute a co-occurring score for each pair of CPCs
where D is the number of datasets in which CPs of both C i and C j occur, P s and P t two CPs from C i and C j , respectively, o(P s ,P t ) the number of binding peaks where P s and P t co-occur, |P| the size of P, and N(Ω dk (C i ,C j )) the number of unique comparisons among the CPs in Ω dk (C i , C j ).
Step 6 construct the CPC co-occurring graph
We construct a CPC co-occurring graph using each CPC as a node, and connecting two CPCs C i and C j by an edge with S CPC (C i , C j ) being the weight if and only if S CPC (C t , C j ) ≥ = y (Figure 2E). The cutoff γ is chosen based on the bimodal distribution of the S CPC sores (Figure 7C).
Step 7 cut the CPC co-occurring graph into dense subgraphs
We apply MCL to cut the CPC co-occurring graph into dense sub-graphs (Figure 2F). Each of these sub-graphs is assumed to correspond to a possible combination of their motifs to form a CRM based on the datasets used. For this reason, we refer to these CPC clusters as CRM components (CRMCs) (Figure 2E).
Step 8 combine highly similar motifs in unique ones
Some motifs in the CRMCs may have overlapping CREs, and can be very similar to one another. It is highly likely that they consist of the same or similar CREs of the same TF or closely related ones. Thus we need to combine such highly similar and possibly redundant motifs into unique ones. To this end, we calculate the pairwise motif similarity of all the motifs in the CRMCs using the SPIC motif similarity metric [105, 106, 107]. We construct a motif similarity graph using the motifs as nodes, and connecting two nodes by an edge with the similarity being the weight if and only if the similarity of the corresponding motifs is greater than 0.7. We identify high density subgraphs in the graph using MCL. For each subgraph, we extend each CRE of each associated motif by padding 5 bp original genomic sequence at each of its two ends. We then identify the common motif in each set of the extended CREs using DREME. For the resulting motifs with more than 50% CRE overlapping and a similarity score more than 0.4, we repeat the above procedure until no two motifs meet the criteria. Each resulting motif has a similarity smaller than 0.4 and an overlapping rate lower than 0.5 with any other motifs. Thus we call each of them a unique motif or Umotif. Each motif in the identified CRMCs is then represented by its Umotif.
Step 9 predict CRMs in the genome
We project CREs of all the CRMCs back to their locations in the genome. If the projected CREs overlap with one another, we merged them in a non-overlapping one. We then connect any two adjacent CREs if their distance is shorter than a preset value δ (δ = 150 bp in this study) according to the distribution of the distances between the CREs in known CRMs (Figure 8B) and the connection cannot span over an exon unless it contain a binding site. We predict as a CRM each segment of sequence connected by CREs of Umotifs in one or multiple CRMCs.
Comparison of our algorithm with a naïve algorithm
Since CRMs are likely to be enriched in our extended peaks, a naïve method that randomly selects sequences from the extended peaks can recover true CRMs. To compare our algorithm with such a naïve method, we concatenated all the genome sequences that are covered by the extended binding peaks according to the order of the sequences on the chromosomes X, Y, 2, 3 and 4, and we connected the two ends of the concatenated sequence to form a circular DNA. For each of CRM predicted by our algorithm, we randomly selected a segment of sequence with the same length as the predicted CRM from the circular DNA. We repeated the process 50 times, and compared their averaged results to our predictions.
This work was supported by the National Science Foundation (EF0849615 and CCF1048261), and the UNC Charlotte (Faculty Research Grants). Funding for open access charge is from CCF1048261. We would like to thank members of the Su lab for discussions. We are thankful to Drs. Steve Gallo and Marc Halfon for providing the detailed CRM data in REDfly, James Johnson for providing a Perl script for matching DREME results to sequences, and Dr. Jason Lieb for his suggestions.
- 1.Consortium CeS: Genome sequence of the nematode C. elegans: a platform for investigating biology. Science. 1998, 282 (5396): 2012-2018.Google Scholar
- 5.Consortium TEP: The ENCODE (ENCyclopedia Of DNA Elements) Project. Science. 2004, 306 (5696): 636-640.Google Scholar
- 7.Bernstein BE, Stamatoyannopoulos JA, Costello JF, Ren B, Milosavljevic A, Meissner A, Kellis M, Marra MA, Beaudet AL, Ecker JR, Farnham PJ, Hirst M, Lander ES, Mikkelsen TS, Thomson JA: The NIH roadmap epigenomics mapping consortium. Nat Biotechnol. 2010, 28 (10): 1045-1048. 10.1038/nbt1010-1045.PubMedCentralPubMedGoogle Scholar
- 9.Temple G, Gerhard DS, Rasooly R, Feingold EA, Good PJ, Robinson C, Mandich A, Derge JG, Lewis J, Shoaf D, Collins FS, Jang W, Wagner L, Shenmen CM, Misquitta L, Schaefer CF, Buetow KH, Bonner TI, Yankie L, Ward M, Phan L, Astashyn A, Brown G, Farrell C, Hart J, Landrum M, Maidak BL, Murphy M, Murphy T, Rajput B, et al: The completion of the Mammalian Gene Collection (MGC). Genome Res. 2009, 19 (12): 2324-2333.PubMedCentralPubMedGoogle Scholar
- 13.Davidson EH: The Regulatory Genome: Gene Regulatory Networks In Development and Evolution. 2006, Waltham, Massachusetts: Academic PressGoogle Scholar
- 14.Tompa M, Li N, Bailey TL, Church GM, De Moor B, Eskin E, Favorov AV, Frith MC, Fu Y, Kent WJ, Makeev VJ, Mironov AA, Noble WS, Pavesi G, Pesole G, Regnier M, Simonis N, Sinha S, Thijs G, van Helden J, Vandenbogaert M, Weng Z, Workman C, Ye C, Zhu Z: Assessing computational tools for the discovery of transcription factor binding sites. Nat Biotechnol. 2005, 23 (1): 137-144. 10.1038/nbt1053.PubMedGoogle Scholar
- 19.Robertson G, Hirst M, Bainbridge M, Bilenky M, Zhao Y, Zeng T, Euskirchen G, Bernier B, Varhol R, Delaney A, Thiessen N, Griffith OL, He A, Marra M, Snyder M, Jones S: Genome-wide profiles of STAT1 DNA association using chromatin immunoprecipitation and massively parallel sequencing. Nat Methods. 2007, 4 (8): 651-657. 10.1038/nmeth1068.PubMedGoogle Scholar
- 20.Chen X, Xu H, Yuan P, Fang F, Huss M, Vega VB, Wong E, Orlov YL, Zhang W, Jiang J, Loh YH, Yeo HC, Yeo ZX, Narang V, Govindarajan KR, Leong B, Shahab A, Ruan Y, Bourque G, Sung WK, Clarke ND, Wei CL, Ng HH: Integration of external signaling pathways with the core transcriptional network in embryonic stem cells. Cell. 2008, 133 (6): 1106-1117. 10.1016/j.cell.2008.04.043.PubMedGoogle Scholar
- 23.Song L, Zhang Z, Grasfeder LL, Boyle AP, Giresi PG, Lee BK, Sheffield NC, Graf S, Huss M, Keefe D, Liu Z, London D, McDaniell RM, Shibata Y, Showers KA, Simon JM, Vales T, Wang T, Winter D, Clarke ND, Birney E, Iyer VR, Crawford GE, Lieb JD, Furey TS: Open chromatin defined by DNaseI and FAIRE identifies regulatory elements that shape cell-type identity. Genome Res. 2011, 21 (10): 1757-1767. 10.1101/gr.121541.111.PubMedCentralPubMedGoogle Scholar
- 24.Crawford GE, Holt IE, Whittle J, Webb BD, Tai D, Davis S, Margulies EH, Chen Y, Bernat JA, Ginsburg D, Zhou D, Luo S, Vasicek TJ, Daly MJ, Wolfsberg TG, Collins FS: Genome-wide mapping of DNase hypersensitive sites using massively parallel signature sequencing (MPSS). Genome Res. 2006, 16 (1): 123-131.PubMedCentralPubMedGoogle Scholar
- 25.Lieberman-Aiden E, van Berkum NL, Williams L, Imakaev M, Ragoczy T, Telling A, Amit I, Lajoie BR, Sabo PJ, Dorschner MO, Sandstrom R, Bernstein B, Bender MA, Groudine M, Gnirke A, Stamatoyannopoulos J, Mirny LA, Lander ES, Dekker J: Comprehensive mapping of long-range interactions reveals folding principles of the human genome. Science. 2009, 326 (5950): 289-293. 10.1126/science.1181369.PubMedCentralPubMedGoogle Scholar
- 42.Negre N, Brown CD, Ma L, Bristow CA, Miller SW, Wagner U, Kheradpour P, Eaton ML, Loriaux P, Sealfon R, Li Z, Ishii H, Spokony RF, Chen J, Hwang L, Cheng C, Auburn RP, Davis MB, Domanus M, Shah PK, Morrison CA, Zieba J, Suchy S, Senderowicz L, Victorsen A, Bild NA, Grundstad AJ, Hanley D, MacAlpine DM, Mannervik M, et al: A cis-regulatory map of the Drosophila genome. Nature. 2011, 471 (7339): 527-531. 10.1038/nature09990.PubMedCentralPubMedGoogle Scholar
- 43.Gerstein MB, Lu ZJ, Van Nostrand EL, Cheng C, Arshinoff BI, Liu T, Yip KY, Robilotto R, Rechtsteiner A, Ikegami K, Alves P, Chateigner A, Perry M, Morris M, Auerbach RK, Feng X, Leng J, Vielle A, Niu W, Rhrissorrakrai K, Agarwal A, Alexander RP, Barber G, Brdlik CM, Brennan J, Brouillet JJ, Carr A, Cheung MS, Clawson H, Contrino S, et al: Integrative analysis of the Caenorhabditis elegans genome by the modENCODE project. Science. 2010, 330 (6012): 1775-1787. 10.1126/science.1196914.PubMedCentralPubMedGoogle Scholar
- 45.ENCODE: A user’s guide to the encyclopedia of DNA elements (ENCODE). PLoS Biol. 2011, 9 (4): e1001046-10.1371/journal.pbio.1001046.Google Scholar
- 46.Roy S, Ernst J, Kharchenko PV, Kheradpour P, Negre N, Eaton ML, Landolin JM, Bristow CA, Ma L, Lin MF, Washietl S, Arshinoff BI, Ay F, Meyer PE, Robine N, Washington NL, Di Stefano L, Berezikov E, Brown CD, Candeias R, Carlson JW, Carr A, Jungreis I, Marbach D, Sealfon R, Tolstorukov MY, Will S, Alekseyenko AA, Artieri C, Booth BW, et al: Identification of functional elements and regulatory circuits by Drosophila modENCODE. Science. 2010, 330 (6012): 1787-1797.PubMedCentralPubMedGoogle Scholar
- 50.Wang J, Zhuang J, Iyer S, Lin X, Whitfield TW, Greven MC, Pierce BG, Dong X, Kundaje A, Cheng Y, Rando OJ, Birney E, Myers RM, Noble WS, Snyder M, Weng Z: Sequence features and chromatin structure around the genomic regions bound by 119 human transcription factors. Genome Res. 2012, 22 (9): 1798-1812. 10.1101/gr.139105.112.PubMedCentralPubMedGoogle Scholar
- 51.Wang J, Zhuang J, Iyer S, Lin XY, Greven MC, Kim BH, Moore J, Pierce BG, Dong X, Virgil D, Birney E, Hung JH, Weng Z: Factorbook.org: a Wiki-based database for transcription factor-binding data generated by the ENCODE consortium. Nucleic Acids Res. 2013, 41 (Database issue): D171-D176.PubMedCentralPubMedGoogle Scholar
- 52.Davidson EH, Rast JP, Oliveri P, Ransick A, Calestani C, Yuh CH, Minokawa T, Amore G, Hinman V, Arenas-Mena C, Otim O, Brown CT, Livi CB, Lee PY, Revilla R, Rust AG, Pan Z, Schilstra MJ, Clarke PJ, Arnone MI, Rowen L, Cameron RA, McClay DR, Hood L, Bolouri H: A genomic regulatory network for development. Science. 2002, 295 (5560): 1669-1678. 10.1126/science.1069883.PubMedGoogle Scholar
- 53.Li XY, MacArthur S, Bourgon R, Nix D, Pollard DA, Iyer VN, Hechmer A, Simirenko L, Stapleton M, Luengo Hendriks CL, Chu HC, Ogawa N, Inwood W, Sementchenko V, Beaton A, Weiszmann R, Celniker SE, Knowles DW, Gingeras T, Speed TP, Eisen MB, Biggin MD: Transcription factors bind thousands of active and inactive regions in the Drosophila blastoderm. PLoS Biol. 2008, 6 (2): e27-10.1371/journal.pbio.0060027.PubMedCentralPubMedGoogle Scholar
- 56.Gerstein MB, Kundaje A, Hariharan M, Landt SG, Yan KK, Cheng C, Mu XJ, Khurana E, Rozowsky J, Alexander R, Min R, Alves P, Abyzov A, Addleman N, Bhardwaj N, Boyle AP, Cayting P, Charos A, Chen DZ, Cheng Y, Clarke D, Eastman C, Euskirchen G, Frietze S, Fu Y, Gertz J, Grubert F, Harmanci A, Jain P, Kasowski M, et al: Architecture of the human regulatory network derived from ENCODE data. Nature. 2012, 489 (7414): 91-100. 10.1038/nature11245.PubMedCentralPubMedGoogle Scholar
- 61.van Dongen S: A cluster Algorithm for Graphs. 2000, Amsterdam: National Research Institute for Mathematics and Computer Science in the NetherlandsGoogle Scholar
- 64.Zhu LJ, Christensen RG, Kazemian M, Hull CJ, Enuameh MS, Basciotta MD, Brasefield JA, Zhu C, Asriyan Y, Lapointe DS, Sinha S, Wolfe SA, Brodsky MH: FlyFactorSurvey: a database of Drosophila transcription factor binding specificities determined using the bacterial one-hybrid system. Nucleic Acids Res. 2011, 39 (Database issue): D111-D117.PubMedCentralPubMedGoogle Scholar
- 65.Kulakovskiy IV, Makeev VJ: Discovery of DNA motifs recognized by transcription factors through integration of different experimental sources. Biophysics. 2009, 54 (6): 667-674. 10.1134/S0006350909060013.Google Scholar
- 72.Siepel A, Bejerano G, Pedersen JS, Hinrichs AS, Hou M, Rosenbloom K, Clawson H, Spieth J, Hillier LW, Richards S, Weinstock GM, Wilson RK, Gibbs RA, Kent WJ, Miller W, Haussler D: Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res. 2005, 15 (8): 1034-1050. 10.1101/gr.3715005.PubMedCentralPubMedGoogle Scholar
- 80.Dennis G, Sherman BT, Hosack DA, Yang J, Gao W, Lane HC, Lempicki RA: DAVID: Database for Annotation, Visualization, and Integrated Discovery. Genome Biol. 2003, 4 (5): 3-10.1186/gb-2003-4-5-p3.Google Scholar
- 83.Woolfe A, Goodson M, Goode DK, Snell P, McEwen GK, Vavouri T, Smith SF, North P, Callaway H, Kelly K, Walter K, Abnizova I, Gilks W, Edwards YJ, Cooke JE, Elgar G: Highly conserved non-coding sequences are associated with vertebrate development. PLoS Biol. 2005, 3 (1): e7-10.1371/journal.pbio.0030007.PubMedCentralPubMedGoogle Scholar
- 87.Wittkopp PJ, Kalay G: Cis-regulatory elements: molecular mechanisms and evolutionary processes underlying divergence. Nat Rev Genet. 2012, 13 (1): 59-69. 10.1038/nri3362.Google Scholar
- 89.Jolma A, Yan J, Whitington T, Toivonen J, Nitta KR, Rastas P, Morgunova E, Enge M, Taipale M, Wei G, Palin K, Vaquerizas JM, Vincentelli R, Luscombe NM, Hughes TR, Lemaire P, Ukkonen E, Kivioja T, Taipale J: DNA-binding specificities of human transcription factors. Cell. 2013, 152 (1–2): 327-339.PubMedGoogle Scholar
- 90.Ernst J, Kheradpour P, Mikkelsen TS, Shoresh N, Ward LD, Epstein CB, Zhang X, Wang L, Issner R, Coyne M, Ku M, Durham T, Kellis M, Bernstein BE: Mapping and analysis of chromatin state dynamics in nine human cell types. Nature. 2011, 473 (7345): 43-49. 10.1038/nature09906.PubMedCentralPubMedGoogle Scholar
- 91.Ram O, Goren A, Amit I, Shoresh N, Yosef N, Ernst J, Kellis M, Gymrek M, Issner R, Coyne M, Durham T, Zhang X, Donaghey J, Epstein CB, Regev A, Bernstein BE: Combinatorial patterning of chromatin regulators uncovered by genome-wide location analysis in human cells. Cell. 2011, 147 (7): 1628-1639. 10.1016/j.cell.2011.09.057.PubMedCentralPubMedGoogle Scholar
- 93.Zhu J, Adli M, Zou JY, Verstappen G, Coyne M, Zhang X, Durham T, Miri M, Deshpande V, De Jager PL, Bennett DA, Houmard JA, Muoio DM, Onder TT, Camahort R, Cowan CA, Meissner A, Epstein CB, Shoresh N, Bernstein BE: Genome-wide chromatin state transitions associated with developmental and environmental cues. Cell. 2013, 152 (3): 642-654. 10.1016/j.cell.2012.12.033.PubMedCentralPubMedGoogle Scholar
- 98.Babak T, Garrett-Engele P, Armour CD, Raymond CK, Keller MP, Chen R, Rohl CA, Johnson JM, Attie AD, Fraser HB, Schadt EE: Genetic validation of whole-transcriptome sequencing for mapping expression affected by cis-regulatory variation. BMC Genomics. 2010, 11: 473-10.1186/1471-2164-11-473.PubMedCentralPubMedGoogle Scholar
- 99.Stranger BE, Montgomery SB, Dimas AS, Parts L, Stegle O, Ingle CE, Sekowska M, Smith GD, Evans D, Gutierrez-Arcelus M, Price A, Raj T, Nisbett J, Nica AC, Beazley C, Durbin R, Deloukas P, Dermitzakis ET: Patterns of cis regulatory variation in diverse human populations. PLoS Genet. 2012, 8 (4): e1002639-10.1371/journal.pgen.1002639.PubMedCentralPubMedGoogle Scholar
- 102.Kasowski M, Kyriazopoulou-Panagiotopoulou S, Grubert F, Zaugg JB, Kundaje A, Liu Y, Boyle AP, Zhang QC, Zakharia F, Spacek DV, Li J, Xie D, Olarerin-George A, Steinmetz LM, Hogenesch JB, Kellis M, Batzoglou S, Snyder M: Extensive variation in chromatin states across humans. Science. 2013, 342 (6159): 750-752. 10.1126/science.1242510.PubMedCentralPubMedGoogle Scholar
- 103.Kasowski M, Grubert F, Heffelfinger C, Hariharan M, Asabere A, Waszak SM, Habegger L, Rozowsky J, Shi M, Urban AE, Hong MY, Karczewski KJ, Huber W, Weissman SM, Gerstein MB, Korbel JO, Snyder M: Variation in transcription factor binding among humans. Science. 2010, 328 (5975): 232-235. 10.1126/science.1183621.PubMedCentralPubMedGoogle Scholar
- 107.Zhang S, Jiang L, Du C, Su Z: A novel information content-based similarity metric for comparing transcription factor binding site motifs. IEEE 6th International Conference on Systems Biology (ISB). 2012:32–36Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.