InTAD: chromosome conformation guided analysis of enhancer target genes
High-throughput technologies for analyzing chromosome conformation at a genome scale have revealed that chromatin is organized in topologically associated domains (TADs). While TADs are relatively stable across cell types, intra-TAD activities are cell type specific. Epigenetic profiling of different tissues and cell-types has identified a large number of non-coding epigenetic regulatory elements (‘enhancers’) that can be located far away from coding genes. Linear proximity is a commonly chosen criterion for associating enhancers with their potential target genes. While enhancers frequently regulate the closest gene, unambiguous identification of enhancer regulated genes remains to be a challenge in the absence of sample matched chromosome conformation data.
To associate enhancers with their target genes, we have previously developed and applied a method that tests for significant correlations between enhancer and gene expressions across a cohort of samples. To limit the number of tests, we constrain this analysis to gene-enhancer pairs embedded in the same TAD, where information on TAD boundaries is borrowed from publicly available chromosome conformation capturing (‘Hi-C’) data. We have now implemented this method as an R Bioconductor package ‘InTAD’ and verified the software package by reanalyzing available enhancer and gene expression data derived from ependymoma brain tumors.
The open-source package InTAD is an easy-to-use software tool for identifying proximal and distal enhancer target genes by leveraging information on correlated expression of enhancers and genes that are located in the same TAD. InTAD can be applied to any heterogeneous cohort of samples analyzed by a combination of gene expression and epigenetic profiling techniques and integrates either public or custom information of TAD boundaries.
KeywordsEpigenomics Transcriptomics Topologically associated domains Enhancers
Enhancer associated gene
Ependymoma brain tumor
Reads per Per Kilobase of transcript, per Million mapped reads
Topologically associated domain
New technologies for analyzing the three-dimensional chromosome organization in a genome-wide manner have revealed mechanisms by which chromosome communication is established . By using different types of high-throughput techniques, such as ChIP-sequencing sensitive for different types of histone modifications, whole genome bisulfite sequencing, ATAC-sequencing, and DNase-Seq, many studies have discovered a large number of enhancers involved in gene regulation. Importantly, the analysis of active chromatin can uncover potential targets relevant for precision treatment of cancer . To associate enhancers with their target genes in the absence of sample-matched chromosome conformation data, several computational methods have been developed.
A widely used approach to associate enhancers with their target genes is to consider the closest genes along the linear DNA. For example, the R package ELMER uses 450 K DNA methylation array data to first define enhancers based on hypo-methylated CpGs and then predicts enhancer target genes by computing the correlation between DNA methylation and gene expression restricting the analysis to the 10 closest genes up- and downstream of the enhancer . Another example is TENET, an analytical approach that associates genome-wide expression changes of transcription factors with gain or loss in enhancer activities by correlating DNA methylation levels at enhancers with the gene expression of transcription factors . However, both tools require DNA methylation array data as input and restrict the correlation to the ‘closest genes’ or to transcription factors that regulate enhancers.
The 11-zinc finger DNA-binding protein CCCTC-binding factor (CTCF) plays an important role in chromatin organization . To improve the identification of gene-enhancer interactions, information on CTCF binding sites can be leveraged. The PreSTIGE method employs this strategy by accessing CTCF ChIP-seq data derived from 13 cell types . Here, CTCF binding sites are considered as insulators separating enhancers from their target genes. This method is currently available as an online application, however, its functionality is limited to the available reference data only and each sample is analyzed independently.
A fundamental concept of chromatin organization is topologically associated domains (TADs). TADs are segments of the genome characterized by frequent chromosome interactions within themselves and they are insulated from adjacent TADs . It has been shown that mutations disturbing the integrity of TADs can lead to the activation of proto-oncogenes causing tumor development [8, 9].
Moreover, InTAD requires a predefined set of TAD regions as input. Since approximately 60–80% of TADs remain stable across cell types , the package comes with a set of TADs derived from IMR90 human fibroblast cell lines , which we have accessed in previous studies [10, 11, 12]. However, to take into account cell-type specific TAD boundaries, other HiC data can also be integrated by providing the resulting TAD regions as input in BED format.
Various parameters allow to control further steps of the analysis workflow. Genes can optionally be filtered based on the analysis of their expression distribution or by selecting specific types of RNA. Further, enhancers and genes are combined when their genomic coordinates are embedded in the same TAD. Since the boundaries of TADs have shown to be sensitive to the analytical method applied and may vary across cell types, genes that do not fall into a TAD are assigned to the nearest TAD by default. Subsequently, correlations between all enhancer-gene pairs within the same TAD are computed by selecting one of the supported methods: Pearson, Kendal or Spearman correlation. In addition, adjusted p-values can be calculated to control the false discovery rate using the R/Bioconductor package qvalue . The final result table includes detailed information about the computed correlation values, adjusted p-values, and Euclidian distances as an additional measure that allows to identify potential correlations that suffer from scale invariance.
The results can be visualized by simulated Hi-C maps highlighting significant correlations at selected genomic loci (Fig. 2b). Additionally, correlations between a selected gene and enhancer pair can be visualized with custom colors by providing annotations that reflect groups of samples (Fig. 2c).
Integration of TAD boundaries improves the identification of enhancer target genes
We have accessed H3K27ac ChIP-seq and RNA-seq data from our previous enhancer mapping study in ependymoma tumors  and verified our previous results by repeating the analysis using our new InTAD software package.
To estimate the dependency between the fraction of enhancer associated genes that can be identified by a given number of samples, we have performed a saturation analysis using our cohort of n = 24 ependymoma tumors. In each iteration, ranging from n = 10 to n = 23, we randomly sampled an according number of tumor samples, identified enhancer associated genes (EAG) using our InTAD software, and compared the number of retrieved EAGs to the number of EAGs obtained when using the entire cohort of n = 24 ependymoma tumors. As a result, we observe a saturation of identified EAGs starting at approximately 16 samples and more than ~ 95% of all EAGs were retained using at least 19 samples (Additional file 1: Figure S1A).
We were also interested in comparing the results of our enhancer-gene correlation method with results obtained when linking enhancers with the closest genes. Therefore, we have annotated the epenydmoma enhancers with the 2 to 10 closest genes located upstream and downstream of the enhancers. By considering an adjusted p-value threshold of 0.01 for our original InTAD correlation analysis, we compared enhancer associated genes detected by both methods (Fig. 3b). As a result, we observe that more than 50% of potential enhancer target genes are missed by the closest gene annotation, even though they are located in the same TAD and their gene expression is significantly correlated with the expression of enhancer elements. Notably, up to 75% of enhancer associated genes annotated by the closest gene approach are also identified by our correlation strategy. The majority (> 99%) of enhancer target genes that are only annotated by the closest gene approach are not located in the same TAD as the enhancer, rendering them as likely false positives.
The inclusion of genes outside TADs increases the sensitivity in detecting enhancer target genes
Among others, this novel approach revealed a strong enhancer element potentially regulating the transcription factor SOX10. SOX10 functions in neural crest and oligodendrocyte development and has previously been described controversially as a negative marker for the diagnosis of ependymoma tumors [16, 17]. Based on our re-analysis of the available gene expression and enhancer data across six intracranial ependymoma subgroups, we find that SOX10 is specifically expressed in the subgroup PF-EPN-A (Fig. 4b), likely regulated by a subgroup-specific enhancer element located ~ 40 Kbp upstream of the gene. These results indicate a tumor-specific chromosome conformation that potentially allows interactions between the PF-EPN-A specific enhancer element and the SOX10 gene. This example demonstrates the importance of the new functionality to allow the usage of empty regions between TADs, especially when accessing reference chromosome conformation data obtained from unrelated cell types.
TADs derived from related cell-types improve the identification of EAGs
The discovery of TADs revealed global levels of stability of chromatin organization across cell types. However, recent studies show that up to 40% of TADs can differ between different tissues and organs . Moreover, it has been shown that different computational methods for the analysis of TADs largely result in different numbers and lengths of TADs for the same data set [18, 19]. To further investigate the impact of the chosen reference chromosome conformation data, we repeated our analysis by using TADs obtained from cerebellum astrocytes provided by the ENCODE project . We selected this cell type since it is expected to be more similar to brain tumors in comparison to the previously accessed IMR90 TADs. The total number of TADs and their mean length appeared to be largely similar between IMR90 and cerebellum astrocytes (Additional file 2: Figure S2A). The majority of EAGs (~ 75%) can be identified by considering any of the two different sets of TADs, however, by considering TADs obtained from cerebellum astrocytes, we identify noticeably more EAGs compared to TADs derived from IMR90 cells (7746 vs 6658, Additional file 2: Figure S2B). Moreover, by considering TADs from cerebellum astrocytes, we can identify additional known ependymoma marker genes as EAGs, such as for example SOX10, due to their co-location with enhancer elements in the same TAD. Importantly, correlations are in average higher between genes and enhancers co-located in TADs that are common in IMR90 and cerebellum astrocytes (Additional file 2: Figure S2C). Similarly, correlations are generally higher in TADs specific to cerebellum astrocytes in comparison to TADs specific to IMR90 cells, providing additional evidence for the relevance of choosing HiC data derived from related cell-types.
In this study we present a novel R/Bioconductor package InTAD that allows to identify enhancer associated genes within and across TADs using epigenetic and transcriptomic data. In comparison to other existing tools, InTAD supports different input data types and overcomes the limits of the “closest gene” strategy by integrating information on TADs obtained from public or custom chromosome conformation analysis experiments. We have employed InTAD for the re-analysis of H3K27ac ChIP-seq and RNA-seq data obtained from 24 ependymoma brain tumors. Additionally, by performing simulation tests we confirmed the benefit of the TADs usage to identify enhancer associated genes based on the comparison to the application of random TADs. It’s important to note that the choice of a specific set of TADs will have an impact on the resulting number of enhancer target genes. If cell-type matched HiC data is unavailable, we recommend to use other publicly available TADs and to adjust the InTAD parameters to allow for the inclusion of genes outside TADs in order to increase the sensitivity. Moreover, there exist different analysis strategies and methods for calling TADs and commonalities and differences of these tools are still under debate in the field [18, 19]. The package also includes other options to control the sensitivity of the workflow such as filtering for lowly expressed genes, calculation of the Euclidian distance, and computation of adjusted p-values. In addition, InTAD allows to generate plots that show predicted chromosome conformation based on enhancer-gene correlations. We expect that InTAD will have a positive impact on future enhancer profiling studies focused on the identification and prioritization of oncogenes or important regulators of cell-type identity in health and disease.
Availability and requirements
Project name: InTAD.
Project home page: https://github.com/kokonech/InTAD
Operating system(s): platform independent.
Programming language: R.
Other requirements: R 3.5.0 or higher, Bioconductor 3.7 or higher.
License: GNU GPL v2.
Any restrictions to use by non-academics: none.
We are thankful to Venu Thatikonda, Michael Fletcher and Clarence Mah for testing the InTAD software package and for providing important feedback.
The project is funded by Hopp-Children’s Cancer Center at the NCT Heidelberg (KiTZ) and German Cancer Research Center (DKFZ). Additional support is provided by the UC San Diego Moores Cancer Center. The funders had no role in the design of the study, the collection, analysis and interpretation of data and in writing the manuscript.
Availability of data and material
The ependyoma brain tumor data analyzed in the study is available in European Genome-phenome archive (https://www.ebi.ac.uk/ega/home) under accession number EGAS00001002696. The cerebellum astrocytes HiC dataset with computed TADs was accessed from ENCODE project experiment ENCSR011GNI.
KO implemented the R package and performed all data analyses. SE conceived the correlation analysis and provided custom scripts. KO and LC wrote the manuscript. JOK, SMP and LC contributed to the design of the study and interpreted the results. All authors read and approved the final manuscript.
Ethics approval and consent to participate
Consent for publication
The authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
- 15.Bass AJ, Dabney A, Robinson D (2018). qvalue: Q-value estimation for false discovery rate control. R package version 214.0, http://github.com/jdstorey/qvalue.
- 16.Švajdler M, Rychly B, Mezencev R, Frohlichova L, Bednarova A, Pataky F, et al. SOX10 and Olig2 as negative markers for the diagnosis of ependymomas: an immunohistochemical study of 98 glial tumors. Histol Histopathol. 2015;19:11654.Google Scholar
- 17.Kleinschmidt-DeMasters BK, Donson AM, Richmond AM, Pekmezci M, Tihan T, Foreman NK. SOX10 distinguishes pilocytic and pilomyxoid astrocytomas from ependymomas but shows no differences in expression level in ependymomas from infants versus older children or among molecular subgroups. J Neuropathol Exp Neurol. 2016;75(4):295–8.CrossRefGoogle Scholar
- 20.ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature. 489(7414):57.Google Scholar
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.