Reveal cell type-specific regulatory elements and their characterized histone code classes via a hidden Markov model
With the maturity of next generation sequencing technology, a huge amount of epigenomic data have been generated by several large consortia in the last decade. These plenty resources leave us the opportunity about sufficiently utilizing those data to explore biological problems.
Here we developed an integrative and comparative method, CsreHMM, which is based on a hidden Markov model, to systematically reveal cell type-specific regulatory elements (CSREs) along the whole genome, and simultaneously recognize the histone codes (mark combinations) charactering them. This method also reveals the subclasses of CSREs and explicitly label those shared by a few cell types. We applied this method to a data set of 9 cell types and 9 chromatin marks to demonstrate its effectiveness and found that the revealed CSREs relates to different kinds of functional regulatory regions significantly. Their proximal genes have consistent expression and are likely to participate in cell type-specific biological functions.
These results suggest CsreHMM has the potential to help understand cell identity and the diverse mechanisms of gene regulation.
KeywordsEpigenomics Cell type-specific regulatory elements Hidden Markov model Histone modification
With the rapid development of sequencing technologies , a myriad of epigenomic data have been generated by both large consortia such as ENCODE , modENCODE , Roadmap Epigenomics Project , and many independent laboratories. Those data involve histone modifications, chromatin openness, DNA methylation, nucleosome positioning and so on. Among them, histone modifications have over 100 types, and the combinatorial presence of them are closely related to distinct patterns of gene regulation. For example, H3K4me1 and H3K27ac were successfully used to identify genome-wide enhancers. In contrast, combination of H3K4me1 and H3K27me3 was a well-studied marker of poised enhancers . With the plenty of epigenomic data available, there is a challenge in computational biology to decode the abundant information hidden behind the functional regulatory elements.
To this end, dozens of computational tools have been developed in the past decade [6, 7, 8, 9, 10, 11, 12, 13, 14, 15]. ChromHMM  is a typical one used by big consortia to generate genome-wide chromatin annotations for diverse cell types based on ChIP-seq peaks of chromatin modifications, transcription factors and DNaseI hypersensitive sites. It utilizes a multivariate hidden Markov model with independent Bernoulli distribution to learn the underlying chromatin states. The algorithm converts raw signals in 200-bp non-overlapping bins into binary values based on the Poisson distribution and then concatenates the epigenomes of multiple cell types to jointly learn the segmentation. Other methods extended such an algorithm from different views recently. For example, EpiCSeg  and GenoSTAN  adapted modeling of emission probability to fit raw count or signal for gaining more information. TreeHMM , hiHMM  and IDEAS  applied more sophisticated hidden structures to reveal relationship between cell types or species. Spectacle  leveraged spectral learning to explicitly model mark combinations and accelerate training process. BdHMM  and dsHMM  took direction into account to better annotate gene structure on both strands of DNA. GBR-Segway  integrated Hi-C data with histone combinations to better annotate the genome.
Although, these methods facilitated the determination and characterization of various chromatin states for a cell type, they do not explore differences between epigenomes of cell types directly, which could provide novel information of cell type-specific biological functions and cell identity . To directly identify cell type-specific regulatory elements (CSREs) by comparing epigenomes, Chen et al.  proposed a differential Chromatin Modification Analysis (dCMA) strategy, and defined CSREs for nine cell lines. Wang and Zhang  adapted this method to determine CSREs across 127 cell types and tissues for a comprehensive characterization of the CSREs and their funcitons. Their analyses found that epigenomic modifications function in cell type-specific manners and CSREs show significant, cell-type-specific biological relevance and tend to be regulatory elements. However, dCMA only locates CSREs for each cell type, but does not directly characterize their underlying specific histone codes. Besides, the CSREs shared by multiple cell types reveal important common functions among them, which were found via overlap analysis for a given group of cell types, but could not be done automatically by dCMA.
To this end, we developed a hidden Markov model to systematically identify CSREs (CsreHMM). Compared to dCMA, this method can additionally learn the subclasses of CSREs and their characterized histone codes directly, which is necessary to explicitly illustrate the diverse functions of CSREs. Besides, CsreHMM could naturally identify groups of cell types which tend to share common CSREs, revealing the common functions among those cell types. We first applied it to a data set of 9 cell types and 9 chromatin marks to demonstrate its effectiveness. The identified CSREs show distinct tendency to known functional regulatory regions. Their proximal genes have consistent expression and are likely to participate in cell type-specific biological functions. These results suggest the HMM model can not only determine significant functionally relevant CSREs, but also reveal CSRE-related specific histone codes which have the potential to help understand the gene regulation and cell identity.
We downloaded the ChIP-seq data of 9 chromatin marks as well as a whole-cell extract (WCE) control across 9 cell types . The cell types consist of human embryonic stem cells (H1), chronic myelogenous leukemia (K562), lymphoblastoid (GM12878), hepatocellular carcinoma (HepG2), human umbilical vein endothelial cells (HUVEC), human skeletal muscle cells and myoblasts (HSMM), normal human lung fibroblasts (NHLF), normal human epidermal keratinocytes (NHEK), and human mammary epithelial cells (HMEC). The nine chromatin marks include CTCF, H3K27me3, H3K36me3, H4K20me1, H3K4me1/2/3, H3K27ac, and H3K9ac. Besides, a whole-cell extract (WCE) was also sequenced as the control for each cell type. From GSE26386, we downloaded the reads that have been aligned to hg18 by MAQ (http://maq.sourceforge.net/maq-man.shtml). For each pair of cell type and mark, replicates were merged and peaks were called. Specifically, the whole genome was divided into 200-bp non-overlapping bins. Each read was extended to 200-bp from 5′ end to 3′ direction and then was assigned to a bin by its midpoint. The peaks were called based on a Poisson background model with λ equaling the average read counts across all bins with a threshold of 10− 4.
Input for HMM
To catch the combinatorial information of multiple marks on each bin, we stacked S(m) for all M marks to form a MN by T matrix O = (S(1); S(2); …; S(M)) (Fig. 1b). Each row of O stands for a cell-mark combination, indicating whether the cell is specific according to that mark. Then we treated the columns of matrix O as observations and trained a multivariate HMM model to reveal the hidden states behind them.
The HMM model
As there are hidden variables, we maximize the likelihood using the incremental expectation-maximization algorithm, which is a variant of Baum-Welch algorithm for accelerating the training process with multiple observations .
There are many ways to initialize the parameters of HMM model in literature. For example, several studies used random initializations . Several studies used k-means clustering to get an initial segmentation and estimate parameters . And several studies used entropy-based method to segment the genome and estimate parameters . Among them, the entropy-based method gives similar initializations for models with different number of states. Hence, models with such initialization would be more comparable across different number of states. Thus, we utilized the entropy-based method to initialize our HMM model.
Determination of specific states
To determine which states are specific to which cell types, we utilized the emission probabilities (Fig. 1c and d; and Additional file 1: Figure S1). For each state, we sorted the emission probabilities of all cell-mark combination decreasingly and found the maximal difference. The probability above it was denoted as ps. To remove small probabilities, we also set a threshold p0 (0.3 was used in this study). Only the cell-mark combination with probability passing both ps and p0 was defined as a specific one. Then the specific state was defined as one with at least one specific cell-mark combination. The name of each specific state was based on its corresponding cell types. A region consisting of consecutive bins covered by a specific state was defined as a cell type-specific regulatory elements (CSRE).
We trained models with number of states from 5 to 70, increased by every 5 states. We found that each model converged during training procedure within 300 iterations, which means we got a local maximal for the log likelihood. We calculated the BIC and AIC scores as BIC = ln(#bins) × # parameters − 2 ln(likelihood) and AIC = 2 × # parameters − 2 ln(likelihood), respectively, where #parameters = (#states − 1) + # states × (#states − 1) + # states × # cells × # marks. We observed that both BIC and AIC scores are monotonically decreasing as number of states is increasing (Additional file 1: Figure S2). Even model of 70 states may not be a minimal. However, for 70-state model, there are lots of similar cell-type-specific states, which cannot be distinguished well with emission probabilities (Additional file 1: Figure S3). Thus, neither BIC nor AIC is a proper criterion for selecting a proper model. Finally, we selected the 30-state model to do downstream analyses. One reason is that the log-likelihood is increasing smoothly from 30- to 70-state models. Another important reason comes to the fact that the 30-state model begins to harbor a specific state marked by H3K36me3.
where ps = (ps, 1, ps, 2, …, ps, R), and s′ is a state in model H. We trained ten 30-state models with random initializations. All of them converged within 500 iterations. We found that the specific states have significantly higher recovery scores than non-specific ones (Additional file 1: Figure S4A and B) which demonstrated the robustness of our results. We also trained models with different numbers as aforementioned. Models with number of states larger than 30 preserve all states in the 30-state model, and hence use additional states to learn other patterns (Additional file 1: Figure S5).
Mapping CSREs to various genomic features
We examined the potential functional relevance of CSREs by mapping them to known genomic features. We leveraged RefSeq annotation to build a TxDb object in Bioconductor on December 12, 2016 and extracted genomic features therein [22, 23]. Each transcript named with a prefix of “NM” by RefSeq was regarded as a gene here. Beyond that, we defined six genomic features: promoter, 5’UTR, 3’UTR, exon, intron and intergenic region. Promoters were defined as regions within 2000 bp of a transcription start site (TSS) and intergenic regions were composed of base pairs in none of the other five features. We assigned each CSRE to one of its overlapping features according to the order: promoter > 5’UTR > 3’UTR > exon > intron > intergenic region.
CSRE proximal genes were defined with a stringent criterion. Only genes with a consecutive 3 kb region within their promoters and bodies covered by CSREs from a specific state are defined as CSRE proximal genes for that state.
Gene expression and specificity
Microarray data were downloaded for all 9 cell types from GSE26386. First, we used RMA to process the raw CEL files. The replicate expression values from the same cell types were then averaged. Next, the expression values of probe sets were averaged according to their corresponding RefSeqs. Finally, the average values were quantile normalized across 9 cell types and used as the expressions.
For each gene, we computed its z-scores of expressions across cell types and defined them as gene specificity scores. High positive (or low negative) specificity score indicates specific high (or low) expression for a gene. Difference of gene specificity scores for groups was tested by two-sample Wilcoxon test.
GO enrichment analysis
We explored the biological functions of CSRE proximal genes by GO enrichment analysis. Each set of concerned genes were mapped to GO terms by org.Hs.eg.db and GO.db Bioconductor packages. Fisher’s exact test was used to get the P-values, which were then corrected by Benjamini-Hochberg method for each cell type. Only GO terms with 5 to 500 genes were kept.
Cell type-specific DNase and EP300 peaks
We obtained the DNase and EP300 peaks from ENCODE by AnnotationHub and then transformed them from hg19 to hg18 version by the liftOver function of rtracklayer. DNase and EP300 peaks were available for 9 and 4 cell types, respectively. Cell type-specific DNase or EP300 peaks of a cell type were defined as part of original peaks that were not covered by peaks from any other cell types. To examine the relationship between CSREs from each specific state, and specific DNase or EP300 peaks in the corresponding cell types, we calculated the overlapping number of them. We randomly sampled 1000 sets of false CSREs for each specific state with length and chromosome reserved and calculated the overlapping number as genome-wide background observations. Then, one-sample Wilcoxon test was used to evaluate the statistical significance of the real number of overlapped ones.
Applying CsreHMM to the roadmap Epigenomics dataset
We downloaded the signals of epigenomic modification tracks [−log10(P-value)] for five histone marks of 127 tissues and cell types (Additional file 2: Table S1) generated by the Roadmap Epigenomics Consortium at http://egg2.wustl.edu/roadmap/data/. The -log10(P-value) was generated by MACS2. We averaged the signal on each 200-bp non-overlapping bin and binarize it by threshold 2, which is recommended by the Roadmap Epigenomics Consortium. The histone marks consist of H3K4me1, H3K4me3, H3K36me3, H3K27me3 and H3K9me3, which relate to regulatory elements, promoters, transcribed chromatin, Polycomb-repressed regions and heterochromatin, respectively.
We trained a 30-state model with s = 5 for the 127 cell types or tissues and a 20-state model with s = 2 for 9 cell types of group “HSC & B cell”. The emission probabilities were analyzed and GO enrichment analysis was conducted for proximal genes of each state as aforementioned.
Diversity of specific states
The 20 specific states have, on average, ~ 35,501 CSREs (ranging from 9554 in HepG2_3 to 77,601 in NHEK_HMEC_1; and Additional file 1: Figure S6A), spanning an average ~ 1% of the genome. The median lengths of CSREs across the 20 states were similar (around 600 bp), except two of them (1200 bp for H1_3 and 2200 for HepG2_3) are longer than the others (Additional file 1: Figure S6B). The genome covered by specific states, varies from ~ 10.5 (HSMM_NHLF) to 51.9 Mb (NHEK_HMEC_1) (Additional file 1: Figure S6C). The number of CSRE proximal genes also varies, from 284 (HSMM_1) to 3459 (HepG2_3) (Additional file 1: Figure S6D). The diversity of those statistics may indicate the functional complexity of those specific states.
Specific states relate to various genomic features
We next explored the relationship between CSREs from different specific states and six genomic features. The proportion and fold change of CSREs in genomic features varies across different specific states (Fig. 2b, and Additional file 1: Figure S7 and S8). Specific states marked by H3K4me1 have more proportion of CSREs in intergenic regions and less in promoters than states with H3K4me3 in corresponding cell types, e.g. K562_1 vs K562_2, which is consistent with that H3K4me1 mainly locates in enhancers but H3K4me3 mainly centered around TSSs. H1_3, the unique state marked by H3K27me3, which is related to Polycomb-repressed region, has the highest proportion of CSREs in promoters, implying their proximal genes are tuned in poised status. Observation of this state is consistent with the characteristic of embryonic stem cells . CSREs of HepG2_3 are substantially enriched in 5’UTR, 3’UTR, exon and intron when compared to those of the other specific states, which is expected as HepG2_3 has specific high H3K36me3 signals.
Even though specific states are not enriched in intergenic regions (Additional file 1: Figure S7), this group still constitutes ~ 43.6% of total CSREs on average, indicating the potential regulatory roles of non-coding regions. For the intergenic CSREs of each specific state, we calculated the distances to their nearest TSSs and found that they are significantly shorter than those of randomly simulated CSREs (Fig. 2c), suggesting they have the tendency to their nearest genes even though they do not overlap them. This implys that intergenic CSREs may regulate its nearby genes.
CSREs demonstrate distinct functional specificity
Relationship between CSREs and DNase peaks or EP300 binding sites
Peaks of DNase-seq are open chromatin around where transcription factors can easily bind to DNA. DNase peaks have been comprehensively exploited to identify regulatory elements in diverse cell types . Differential DNase-seq footprinting identifies cell type determining transcription factors . Thus, we expected CSREs were likely to be proximal to cell type-specific DNase peaks. Indeed, CSREs from all specific states overlap significantly more peaks than the random simulated ones do (Additional file 1: Figure S10), which suggests that CSREs, as well as their underlying modifications, could play a crucial role in cell type-specific regulatory activities.
CSREs reveal cell type-specific behavior of genes: Two case studies
For the CSREs shared by two cell types, we expected that their proximal genes also function specifically in both cell types. We took a CSRE in NHEK_HMEC_2 as an example. We found that the third longest CSRE of NHEK_HMEC_2 is a 9600-bp region encompassing the whole gene body of KRT14 (Additional file 1: Figure S16). This gene provides instructions for making keratin 14, which is a fibrous protein making up skin . Besides, it was also known as an epithelial marker . As expected, we observed strikingly expressed KRT14 in both NHEK and HMEC (Additional file 1: Figure S17B). Intriguingly, in the two cell types, more than 3/4 of the CSRE harbors active marks H3K27ac, H3K9ac and H3K4me1/2, which are nearly empty in this region among the other cell types (Additional file 1: Figure S17A), indicating KRT14 may be regulated by the precise histone modification pattern in both cell types.
Application of CsreHMM to a large-scale dataset reveals hierarchical specific CSREs
We also applied CsreHMM to a large-scale dataset of 127 cell types and tissues from the Roadmap Epigenomics Project  (Additional file 2: Table S1). Some of these cell types or tissues come from the same lineage and are very similar to each other. As the difference of cell types from different lineages would be larger than that of cell types within the same lineage, directly applying CsreHMM to this dataset would more likely focus on difference between lineages and lead to lineage-specific chromatin modified region.
Even though it is hard to focus on the difference within a lineage by directly applying CsreHMM to the whole dataset, we can still achieve it by applying the model to epigenomics within a specific lineage. For example, we trained a 20-state model for 9 cell types in group “HSC & B cell”, to see the subtle difference among them (Fig. 7b). We found that state 1, 14 and 18 has neutrophils-specific H3K36me3, H3K4me1 and H3K4me3 signal, respectively, indicating that they are activate regulators of neutrophils. Surprisingly, for all the 3 states, their proximal genes are significantly enriched in GO term “neutrophil activation” (Additional file 2: Table S3). State 19 obtains nature killer cell-specific H3K4me1 signal. Interestingly, its proximal genes are significantly enriched in “T cell activation” (Additional file 2: Table S3), which seems surprising but is consistent with recent observation that NK cells contribute to the activation of T cells .
In summary, this application demonstrates the ability of CsreHMM to find lineage- or cell type-specific regulatory elements from large-scale epigenomic data.
Here we introduced a comparative computational method CsreHMM to systematically identify cell type-specific regulatory elements along the whole genome and their characterized histone codes. We applied our method to a ENCODE dataset and found that two thirds of states from the trained HMM model were identified as specific ones, illustrating its efficiency in revealing more detailed regulatory characteristic. The identified CSREs were enriched in different kinds of regulatory regions; their proximal genes were likely to participate in cell type-specific biological functions; the expressions of those genes were also in line with the underlying histone codes of their proximal CSREs. All those results demonstrate the effectiveness of our method.
Compared with dCMA, CsreHMM can not only locate CSREs for each cell type, but also identify the mark combinations that characterize their specificity and reveal their sub-patterns and explicitly label the CSREs shared by multiple cells. Those additional features can bring us a more deep understanding of CSREs. However, the limited number of states can only capture recurrent types of CSREs, consequently omitting the rare ones, which can be picked up by carefully examining the histone codes of each CSRE provided by dCMA. Thus, CsreHMM is more suitable to get a general picture of CSREs to help understand specific behaviors of histone modifications in a cell type and the formation of cell identity.
Large epigenomic datasets usually contain cell types from the same lineage. When applied to such a dataset, CsreHMM would be more likely to find lineage-specific regulatory elements, rather than cell type-specific ones. Even thouth increasing the number of states would grasp subtle difference between cell types within a lineage, and may discover the cell type-specific regulatory elements, this procedure would also increase the training time quadratically. Instead, we suggest to apply CsreHMM to a specific lineage of cell types to make the comparison more reasonable and make the cost much lower.
With the continuous generation of more genome-wide epigenomic data by large consortium like IHEC , we expect this method to become a useful tool for investigating diverse chromatin modifications among multiple cell types or conditions.
We are grateful to the early effort of Miss Yiyi Yin on this project during her visit to our lab.
This work has been supported by the National Natural Science Foundation of China [11661141019, 61621003, 61422309, 61379092]; Strategic Priority Research Program of the Chinese Academy of Sciences (CAS) [XDB13040600]; National Ten Thousand Talent Program for Young Top-notch Talents; Key Research Program of the Chinese Academy of Sciences [KFZD-SW-219]; National Key Research and Development Program of China [2017YFC0908405]; CAS Frontier Science Research Key Project for Top Young Scientist [QYZDB-SSW-SYS008]. Publication costs are funded by the National Natural Science Foundation of China [No. 11661141019].
Availability of data and materials
All data analysed during this study are included in this published article [and its supplementary information files].
About this supplement
This article has been published as part of BMC Genomics Volume 19 Supplement 10, 2018: Proceedings of the 29th International Conference on Genome Informatics (GIW 2018): genomics. The full contents of the supplement are available online at https://bmcgenomics.biomedcentral.com/articles/supplements/volume-19-supplement-10.
C.W. and S.Z. designed the experiments, analysed the data, and wrote the paper. All authors have read and approved the manuscript.
Ethics approval and consent to participate
The data used in this study are accessed from the public database. No ethics approval is needed.
Consent for publication
The authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
- 13.Zacher B, Lidschreiber M, Cramer P, Gagneur J, Tresch A. Annotation of genomics data using bidirectional hidden Markov models unveils variations in pol II transcription cycle. Mol Syst Biol. 2014;10:768.Google Scholar
- 28.Zhou P, Gu F, Zhang L, Akerberg BN, Ma Q, Li K, He A, Lin Z, Stevens SM, Zhou B, et al. Mapping cell type-specific transcriptional enhancers using high affinity, lineage-specific Ep300 bioChIP-seq. eLife. 2017;6:e22039.Google Scholar
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.