TADKB: Family classification and a knowledge base of topologically associating domains
Topologically associating domains (TADs) are considered the structural and functional units of the genome. However, there is a lack of an integrated resource for TADs in the literature where researchers can obtain family classifications and detailed information about TADs.
We built an online knowledge base TADKB integrating knowledge for TADs in eleven cell types of human and mouse. For each TAD, TADKB provides the predicted three-dimensional (3D) structures of chromosomes and TADs, and detailed annotations about the protein-coding genes and long non-coding RNAs (lncRNAs) existent in each TAD. Besides the 3D chromosomal structures inferred by population Hi-C, the single-cell haplotype-resolved chromosomal 3D structures of 17 GM12878 cells are also integrated in TADKB. A user can submit query gene/lncRNA ID/sequence to search for the TAD(s) that contain(s) the query gene or lncRNA. We also classified TADs into families. To achieve that, we used the TM-scores between reconstructed 3D structures of TADs as structural similarities and the Pearson’s correlation coefficients between the fold enrichment of chromatin states as functional similarities. All of the TADs in one cell type were clustered based on structural and functional similarities respectively using the spectral clustering algorithm with various predefined numbers of clusters. We have compared the overlapping TADs from structural and functional clusters and found that most of the TADs in the functional clusters with depleted chromatin states are clustered into one or two structural clusters. This novel finding indicates a connection between the 3D structures of TADs and their DNA functions in terms of chromatin states.
TADKB is available at http://dna.cs.miami.edu/TADKB/.
KeywordsTopologically associating domains TADs Family classification Single-cell 3D genome structures Long non-coding RNAs lncRNAs
Long non-coding RNA
Topologically associating domain
Topologically associating domains (TADs) are DNA segments that are considered the structural and functional units of the mammalian genomes [1, 2]. The length of TADs varies from hundreds of kilobases up to a few million bases . The boundaries of TADs are enriched with different factors , including the insulator binding protein CTCF and housekeeping genes. TADs pervade the whole genome, remain consistent across different cell types, and are highly conserved between humans and mice . Recently, TADs have been widely considered as the unit of chromosome organization  and being studied together with genes, CTCF, cohesion, and chromatin loops [2, 4, 5]. There are many methods that have been developed to detect topologically associating domains [1, 2, 6, 7, 8, 9, 10, 11, 12, 13]. Most of them are based on the finding that the Hi-C contacts within a TAD are apparently more frequent and enriched than those between two different domains , which is the fundamental rule for defining domain locations in mammalian chromosomes.
The Hi-C experiments  can capture the genome-wide proximate relationship between genomic locations based on millions of cells. The resolution of Hi-C experiments has been largely improved from originally 1 Mb in  to recently 1 kb in . This high resolution makes it possible to detect enough Hi-C contacts within a TAD or detect genome-wide loops. For example, the study  identified about 10,000 loops, which often indicate promoter and enhancer interactions that is highly related to gene regulation. Studies also found that the loops usually are conserved between different cell types and species [2, 15].
The availability of high-resolution Hi-C contacts also makes it possible to reconstruct the three-dimensional (3D) structure of chromosomes. The Hi-C contact data indicate the proximate relationship between two genomic locations, with enough number of which computational algorithms can be used to construct a 3D structure that meets the Hi-C contacts. The early work conducted by Duan et al.  constructed the 3D structure of yeast genome based on 4C-related experiment (4C, a type of chromosome conformation capture experiment that was designed before the invention of Hi-C experiment). ChromSDE  uses semi-definite programming to construct 3D models, whereas Trieu et al.  applied optimization after obtaining the in-contact and not-in-contact relationships for bead pairs. PASTIS  uses metric multidimensional scaling to construct 3D structures, which at first calculates a wish distance between every pair of beads (a chromosome is evenly divided into beads with the same length). This wish distance is calculated directly from the number of Hi-C contacts by d ~ c-1/3 (d is the wish distance; and c is the number of Hi-C contacts) so that higher number of Hi-C contacts indicate shorter wish distances. The multidimensional scaling algorithm tries to find a 3D structure that best meets all the wish distances.
The converting formula d ~ c-1/3 has a drawback, that is, when c is larger than 10 the converted distances are converged to a very small value. To overcome the drawback, instead of using the same parameter (1/3) for all Hi-C contacts we  defined a novel type of complex network based on Hi-C contacts and assigned a converting parameter for each pair of Hi-C contacts based on their affinity to the neighbors, from which we further inferred the wish distance for each bead pair. Based on the bead-pair specific wish distances, we reconstructed the 3D structures of chromosomes and TADs at the 40 kb resolution . Although this technique was not used in TADKB, it is worth mentioning it for a broad review of the algorithms used to reconstruct genome 3D structures.
Given a distance matrix, reconstructing a 3D structure can be considered as a dimensionality reduction problem. Generally speaking, the methods to achieve that can be classified to linear (e.g., principal component analysis) and non-linear (e.g., multi-dimensional scaling  and t-distributed stochastic neighbor embedding ) methods. Non-linear methods are more complicated than the linear ones and can capture the non-linear relationships from the input data. Among most of the non-linear methods, t-distributed stochastic neighbor embedding (t-SNE) used Gaussian joint probabilities to represent affinities in the original space and Student’s t-distributions to represent affinities in the embedded space . It has been claimed in  that the t-SNE method has advantages such as being able to reveal the structures at different scales. Therefore, it can be used to capture and reconstruct local structures from single-cell Hi-C contact matrices [23, 24].
Long non-coding RNA (lncRNA) is defined as transcript of > 200 nucleotides that cannot be translated into protein. It has been found that > 74% of human genome is transcribed to RNA; however, only 2% of the transcripts are finally translated into proteins . Therefore, non-coding RNAs take a large portion in human genome and have been considered as “junk”. It is until recently that more and more research has confirmed lncRNA’s functions in gene expressions regulation [26, 27], epigenetic modification [28, 29, 30], and chromatin structures controlling . For example, Xist is a lncRNA with gene locus located in the X-chromosome of mammal cells. Its important function is to inactivate one copy of X chromosome in female cells. Because every diploid wild-type female mammal cell has two copies of X chromosomes, in order to balance the amount of gene expressions or to perform “dosage compensation”, one of the X chromosomes in female is inactivated with highly compacted structure and silenced in terms of gene expression. This inactivation process is done by Xist lncRNAs that alter the 3D structure of X chromosome and eventually inactivate one copy of X chromosomes in female . There are multiple databases for lncRNA such as NONCODE 2016 , LNCipedia 4.0 , and lncRNAdb 2.0 . However, different lncRNA databases have different naming standards, which causes the problem that the same lncRNA has different IDs in different databases.
We built topologically associating domain knowledge base (TADKB), a knowledge base for TADs integrated with annotations of protein-coding genes and lncRNAs. TADKB defined TADs’ families based on the common TADs shared in two types of clusters: (1) structural clusters based on 3D structural similarities; (2) chromatin-state clusters from the fold enrichment similarities of chromatin states. Moreover, TADKB unifies three lncRNA databases allowing users to cross-reference between them when they have different IDs for the same lncRNA.
Construction and content
TADKB provides the TADs called from eleven cell types: GM12878, HMEC, NHEK, IMR90, KBM7, K562, and HUVEC for human , and CH12-LX, ES, NPC, and CN for mouse . The normalized Hi-C contact matrices were downloaded from the Gene Expression Omnibus (GEO) with ID GSE63525 for the first eight cell types at the resolutions of 50 kb and 10 kb and GEO GSE96107 for the last three cell types at the resolutions of 50 kb and 10 kb. The TAD locations for all of the cell types were detected using three different methods: (1) Directionality Index (DI) , Gaussian Mixture model And Proportion test (GMAP) , and Insulation Score (IS) . For IS, we first combined the overlapping boundary regions and called domains between two successive boundaries. We also used two Hi-C variants: HiChIP  and SPRITE , and both the variants provided two cell lines’ high-resolution chromatin contact data, including GM12878 and mES. The details of domain-detection results are shown in Additional file 1: Table S1. Hi-C data are normalized using KR [2, 41], whereas HiChIP and SPRITE data are normalized using Hi-Corrector  with 100 iterations. All TAD annotations described in Additional file 1: Table S1 can be downloaded from TADKB’s download webpage.
Because the scale of Hi-C contacts widely varies and the contact-to-distance converting formula d = (1/c)(1/3) as defined in  is sensitive to the scale of the number of Hi-C contacts , we first rescaled the Hi-C contacts of each TAD to the range [1, 30] via linear transformation without considering missing Hi-C values. We then used the formula d = (1/c)(1/3) to convert Hi-C contacts (c) into wish distances (d). We reconstructed each TAD’s 3D structure using two manifold learning methods including metric multidimensional scaling (MDS) and t-distributed Stochastic Neighbor Embedding (t-SNE)  implemented in Scikit-learn  by reducing the dimensionality to three components. We found that the reconstructed 3D structures of TADs using t-SNE are very sensitive to two parameters (i.e., perplexity and learning rate). Therefore, we generated multiple 3D structures for each TAD using t-SNE with different configurations of the two parameters, superimposed these structures with the one predicted by MDS method , and selected the structure with the minimum root-mean-square deviation (RMSD) as the final structure from t-SNE.
We evaluated the reconstructed 3D structures using the correlation between exponent parameter (measuring the contact probability against genomic distances based on Hi-C contact maps, see definition in Additional file 1) and radius of gyration (measuring the compactness of reconstructed 3D structures) as described in our previous work . Because a better reconstructed 3D structure should have a high consistency between the 2D structural characteristics represented by exponent parameter and the 3D compactness represented by radius of gyration, we calculated the correlations between all TADs’ exponent parameters and radius of gyration for MDS- and t-SNE-inferred structures in GM12878. The Pearson’s and Spearman correlation coefficients between contact-probability-based exponent parameters and MDS-based radius of gyrations are − 0.71 (P-Value < 2.2e-16) and − 0.77 (P-Value < 2.2e-16), respectively, whereas the correlations between contact-probability-based exponent parameters and t-SNE-based radius of gyrations are − 0.08 (P-Value = 8.2e-06) and − 0.02 (P-Value = 0.2487). Our evaluation results indicate that the structures inferred by MDS share higher consistency than the structures inferred by t-SNE. Therefore, we used MDS-based structures in the downstream analysis. The 3D structures of the chromosomes and TADs were inferred using the same method.
We used our in-house tool named SCL (manuscript submitted) to reconstruct the 3D structures of chromosomes based on single-cell Hi-C data. The single-cell haplotype-resolved chromosomal 3D structures at 40 kb resolution of 17 GM12878 cells were generated based on the single-cell Hi-C data released from . For the chromosomes 10 and 19 of cell 1, chromosomes 1, 2, 4, and 11 of cell 4, all chromosomes of cell 8, and chromosome 6 of cell 10, the raw single-cell Hi-C contacts (file name *.raw.con.txt.gz) were used to infer their 3D structures. For all other chromosomes and cells, the single-cell Hi-C contact after imputation were used (file name *.impute3.round4.con.txt.gz). All single-cell Hi-C data were downloaded from .
After obtaining the reconstructed 3D structures, we used 3D structure alignment tools to compare the structural similarity between any given two TADs. In this study, we used TM-align  to superimpose two TADs’ structures and obtained the TM-score as the structural similarity score normalized by the length of the smaller TAD. Therefore, given the reconstructed 3D structures of all TADs in a genome we used TM-score to generate a structural similarity matrix.
We next used chromatin-state annotation  to explore the chromatin-state similarity between any two TADs. We downloaded the 25-state annotations from the roadmap epigenomics project  for six cell types including GM12878, HMEC, HUVEC, IMR90, K562, and NHEK. The 25 states are (1) active TSS, (2) promoter upstream TSS, (3) promoter downstream TSS 1, (4) promoter downstream TSS 2, (5) transcribed-5′ preferential, (6) strong transcription, (7) transcribed-3′ preferential, (8) weak transcription, (9) transcribed & regulatory (Prom/Enh), (10) transcribed 5′ preferential and Enh, (11) transcribed 3′ preferential and Enh, (12) transcribed and weak Enhancer, (13) active enhancer 1, (14) active enhancer 2, (15) active enhancer flank, (16) weak enhancer 1, (17) weak enhancer 2, (18) primary H3K27ac possible Enhancer, (19) primary DNase, (20) ZNF genes & repeats, (21) heterochromatin, (22) poised promoter, (23) bivalent promoter, (24) repressed polycomb, and (25) quiescent/low. For each TAD in each of the six cell types with available chromatin-state annotations, we computed its fold enrichment of each state using the OverlapEnrichment function in ChromHMM . Given any two TADs in a cell type, we calculated the Pearson’s correlation coefficient between their fold enrichment values and treated the absolute value of the correlation as the chromatin-state similarity score. In this way, we generated a functional similarity matrix for each cell type.
After that, we clustered TADs based on their similarities at the structural and chromatin-state aspects. We used Spectral Clustering  implemented in Scikit-learn  as the clustering algorithm as it outperforms the other algorithms (e.g., Affinity Propagation ) when dealing with non-convex clusters.
We downloaded protein-coding gene annotations from Ensembl  and lncRNA annotations from NONCODE 2016 , LNCipedia 4.0 , and lncRNAdb 2.0 . Since we use hg19 and mm9 as reference genomes when identifying domain locations, gene data that are inconsistent with the two reference genomes are first converted using liftOver  to hg19 human or mm9 mouse genome coordinates. We mapped genes onto TADs for each of the eleven cell types by comparing their genomic positions. For example, if a lncRNA’s genomic position has an overlap with a TAD’s genomic positions (i.e., start and end positions), then we labeled this lncRNA to belong to this TAD. The sequence search function was implemented based on BLAST .
Utility and discussion
TADKB has the following main components: browse, family view, acrossCells, search, and download. Detailed description of each component will be presented as follows.
TAD family component
We compared the clusters of chromatin states and 3D structures and found that there are overlapping TADs, that is, the same TADs were found in both types of clusters. An example shown in Fig. 9(c) has the numbers of chromatin-state and structural clusters equal to 20 and 5, respectively. We then normalized the number of overlapping TADs by the sizes of the two types of clusters to obtain the overlapping TAD enrichment, which is insensitive to the size of clusters. For example, the number of overlapping TADs between the structural cluster number 1 and chromatin-state cluster number 1 is 18 (see Fig. 9(c)); This value 18 was divided by 167 (the size of chromatin-state cluster number 1) and further divided by 338 (the size of structural cluster number 1), which results in 0.00031 (times 1000 for better visualization); and the final value is 0.3 (see the value in the left-bottom in Fig. 9(b)).
From Fig. 9(a) and (b), we observed that most of the TADs in the chromatin-state clusters that are depleted of chromatin states (e.g., 2, 12, and 20) can be found in the second and third structural clusters, especially in the third structural cluster. We tested all the possible number-of-cluster combination configurations (i.e., select one from 10, 20, and 30 as the number of chromatin-state clusters, and select one from 2, 3, 5, and 10 as the number of structural clusters) for TADs detected by DI at 50 kb resolution of the six human cell types, including GM12878, HMEC, HUVEC, IMR90, K562, and NHEK (6 × 3 × 4 = 72 heat-maps; all can be downloaded from the TADKB website) and observed the same patterns, that is, most of the TADs in the chromatin-state clusters that are depleted of chromatin states can be found in one or two structural clusters, indicating that this observation does not occur by accident. This observation may provide a novel way to connect TADs’ 3D structures with DNA functions indicated by chromatin states.
We plotted the distributions of exponent parameters and radius of gyrations of the mutual TADs overlapped in (1) the structural cluster number 3 and the chromatin-state cluster number 2 (86 mutual TADs), and (2) the structural cluster number 3 and the chromatin-state cluster number 12 (23 mutual TADs) (Fig. 9(d) and (e)) and found that compared with the other mutual sets, the TADs in these two mutual sets have smaller exponent parameters and larger radius of gyrations, which may indicate that these TADs have a less compacted 3D structure and they all have depleted chromatin-state enrichment. We also plotted the gene density distribution (Fig. 9f), showing that the TADs in these two mutual sets have apparently smaller gene density.
We next explored whether our observations are resulted from heterochromatins or gene desert. First, we downloaded the gap table for hg19 from UCSC genome Table Browser, compared the gaps of heterochromatins and centromeres with the 2773 TADs from GM12878, and found that (1) only 15 TADs (see Additional file 1: Table S2 for details of the 15 TADs) are overlapped with some heterochromatin or centromere regions; (2) only three out of 15 TADs belong to the two structural clusters (clusters 2 and 3 in Fig. 9) with depleted chromatin state enrichment. Therefore, we think our observations are not related to heterochromatins or centromeres. Second, from Fig. 9 we can observe that most of the TADs have positive gene densities, indicating that most of the TADs do not belong to gene desert. Therefore, we think our observations may not be related to gene desert neither.
We listed the chromosomes, coordinates, exponent parameters, and radius of gyrations of the TADs in the overlapping sets between (1) the chromatin-state cluster number 2 and the structural cluster number 3 (Additional file 1: Table S3), (2) the chromatin-state cluster number 12 and the structural cluster number 3 (Additional file 1: Table S4), and (3) the chromatin-state cluster number 20 and the structural cluster number 3 (Additional file 1: Table S5). We gathered the coding-genes existent in the TADs in these three sets and run a GO enrichment test using AmiGO2 (http://amigo.geneontology.org/rte). The enriched GO terms in biological process ontology (BPO), cellular component ontology (CCO), and molecular function ontology (MFO) are also listed in the caption of corresponding Additional file 1.
TAD search component
The download component allows users to download 90 TAD annotation files as described in Additional file 1: Table S1 and 72 heatmaps about the overlapping TAD analysis between chromatin-state and structural clusters.
TADKB, a database for topologically associating domains, has been built that integrates the 2D and 3D structures of TAD, 3D structure of chromosome, annotations of coding genes and lncRNAs, loops or peaks, and family classification of TADs. TADKB allows users to view the genomic locations of coding gene, lncRNAs, and loops on the 3D structure of TAD. It also integrates three major lncRNA databases so that the different IDs from different lncRNA databases can be unified. The TAD families in TADKB are defined as the overlapping TADs found in chromatin-state and structural clusters. We also found that most of the TADs in depleted chromatin-state clusters also exist in one or two structural clusters; and these TADs mostly have smaller exponent parameter and larger radius of gyration. TADKB provides a convenient searching function so that based on a query DNA sequence the TADs that contains the hits of the query sequence will be outputted. Based on the other TADs within the same family of the hit TAD(s), more annotations may be provided for the query sequence. The role that lncRNA plays in forming up chromosome 3D structures is not yet clear or determined. However, lncRNAs have been eventually found to be playing an important role in either assisting or regulating many important DNA functions although lncRNAs had been originally considered not functioning at all in the genome. Therefore, the annotations of lncRNAs are also integrated into TADKB.
Research reported in this publication was supported by the National Institute of General Medical Sciences of the National Institutes of Health under Award Number R15GM120650 to ZW and start-up funding from the University of Miami to ZW. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health or the University of Miami.
Availability of data and materials
TADKB can be freely accessed at http://dna.cs.miami.edu/TADKB/.
TL generated most of the knowledge in the website and designed and built the website. TL acquired most of the data, analyzed and interpreted the data, and generated the figures in the manuscript. JP participated in the design and building of the database at the early stage of development. CZ acquired the data related to protein functions, conducted functional analysis of TADs, and wrote related parts of the manuscript. HZ acquired the single-cell Hi-C data, generated the three-dimensional structures of chromosomes based on single-cell Hi-C data, and wrote related part of the manuscript. TL and ZW drafted the manuscript. NW, ZS, and YM provided scientific advices and participated in the design of the database. All authors edited the manuscript. ZW conceived and advised the research. All authors have read and approved the manuscript.
Ethics approval and consent to participate
Consent for publication
The authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
- 5.Rudan MV, Barrington C, Henderson S, Ernst C, Odom DT, Tanay A, Hadjur S. Comparative hi-C reveals that CTCF underlies evolution of chromosomal domain architecture. Cell Rep. 2015;10(8):1297–309.Google Scholar
- 21.Kruskal JB. Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis. Psychometrika. 1964;29(1):1–27.Google Scholar
- 22.Lvd M, Hinton G. Visualizing data using t-SNE. J Mach Learn Res. 2008;9(Nov):2579–605.Google Scholar
- 32.Engreitz JM, Pandya-Jones A, McDonel P, Shishkin A, Sirokman K, Surka C, Kadri S, Xing J, Goren A, Lander ES. The Xist lncRNA exploits three-dimensional genome architecture to spread across the X chromosome. Science. 2013;341(6147):1237973.Google Scholar
- 33.Zhao Y, Li H, Fang S, Kang Y, Hao Y, Li Z, Bu D, Sun N, Zhang MQ, Chen R. NONCODE 2016: an informative and valuable data source of long non-coding RNAs. Nucleic Acids Res. 2015. https://doi.org/10.1093/nar/gkv1252.
- 35.Quek XC, Thomson DW, Maag JL, Bartonicek N, Signal B, Clark MB, Gloss BS, Dinger ME. lncRNAdb v2.0: expanding the reference database for functional long noncoding RNAs. Nucleic Acids Res. 2015;43(Database issue):D168–73.Google Scholar
- 40.Quinodoz SA, Ollikainen N, Tabak B, Palla A, Schmidt JM, Detmar E, Lai MM, Shishkin AA, Bhat P, Takei Y et al. Higher-Order Inter-chromosomal Hubs Shape 3D Genome Organization in the Nucleus. Cell. 2018;174(3):744–57 e724.Google Scholar
- 41.Knight PA, Ruiz D. A fast algorithm for matrix balancing. IMA J Numer Anal. 2013;33(3):1029–47.Google Scholar
- 43.Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V. Scikit-learn: machine learning in python. J Mach Learn Res. 2011;12(Oct):2825–30.Google Scholar
- 44.Kabsch W. A discussion of the solution for the best rotation to relate two sets of vectors. Acta Crystallogr Sect A: Cryst Phys, Diffr, Theor Gen Crystallogr. 1978;34(5):827–8.Google Scholar
- 45.Liu T, Wang Z: Measuring the three-dimensional structural properties of topologically associating domains. In: 2018 IEEE International Conference on Bioinformatics and Biomedicine (BIBM): 2018: IEEE; 2018: 21–28.Google Scholar
- 50.Shi J, Malik J. Normalized cuts and image segmentation. IEEE T Pattern Anal. 2000;22(8):888–905.Google Scholar
- 56.Park C, Yu N, Choi I, Kim W, Lee S. lncRNAtor: a comprehensive resource for functional investigation of long noncoding RNAs. Bioinformatics. 2014. https://doi.org/10.1093/bioinformatics/btu325.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.