Introduction

Model organisms have been indispensable for understanding the functional roles of cis-regulatory elements (CREs) in complex biological phenomena such as fetal development and pathogenesis. Some human CREs with putative functional roles have been validated by dissecting the mutational impact of their conserved orthologous counterparts in model animals1. However, identifying orthologous CREs, especially for distal enhancers, is computationally challenging both due to their rapid evolution and their degenerate sequence structures. Functional characterization of regulatory elements2 and non-coding GWAS disease-associated variants3,4,5,6 typically begins with mapping human enhancers to mouse with sequence alignment, which suffers from low sensitivity, and thus the accurate identification of orthologous CREs has long been a bottleneck in efforts to improve our understanding of CRE function.

Enhancers are DNA sequences harboring multiple transcription factor binding sites (TFBS) and are important regulators of gene expression7. In spite of their importance, enhancers have evolved rapidly relative to protein-coding sequences8,9 and promoters10,11,12. Since DNA motifs for TFBS are often degenerate and spacing between TFBS typically does not contribute to enhancer function, nucleotide substitutions can accumulate more readily without significant functional changes in enhancers13, and this flexibility likely aids regulatory evolution. Further, duplication of redundant TFBS and accumulation of enhancer mutations can lead to turnover of TFBS through stabilizing selection13. This flexibility also applies on a larger scale to combinations of enhancers within intergenic loci, as enhancer function is weakly constrained by position relative to the target promoter, and enhancers are often accompanied by redundant shadow enhancers that regulate the same target genes14,15,16. As a result, the functional and mechanistic properties of enhancers likely have facilitated their rapid turnover throughout evolutionary history while maintaining DNA binding specificities of transcription factors (TFs)10,17. Our analysis below supports this picture of rapid enhancer evolution in the context of conserved TF-binding specificity across a broad range of cell types.

Previously, several groups have observed that many putative enhancers, marked by chromatin accessibility10, TF binding12, and/or histone modifications11 characteristic of CREs (e.g., H3K27ac, H3K4me1/3), lack functional conservation at orthologous loci of distant mammals predicted by conventional genome alignment (e.g., LASTZ18,19) and mapping algorithms (e.g., LiftOver20). This apparent lack of enhancer conservation is largely due to the rapid evolution of distal enhancers, but limitations in conventional computational genome alignment algorithms to detect conservation can also contribute significantly. Most conventional genome alignment algorithms utilize a seed-and-extend strategy, where short sequence matches are first obtained as seeds and then extended from both ends for further base pair alignment18,19. However, such nucleotide-level modeling of enhancer evolution may not be optimal for resolving sequence structures of enhancers since enhancers are characterized by collections of multiple degenerate TFBS with flexible composition.

This paper addresses two major goals. First, we quantitatively derived 45 pairs of human and mouse cell/tissues from >1000 ENCODE DNase-seq experiments10,21,22, through which we systematically quantified cell/tissue-specific enhancer conservation. Through multiple orthogonal analyses, we show that conservation levels of enhancers depend strongly on the cell/tissue type, which is partly explainable by association with transposable elements (TE). Second, we present a novel genome alignment method that incorporates gapped-kmer features to model sequence degeneracy of enhancers: gkm-align. This feature choice is motivated by sequence modeling using gapped-kmer composition which has been shown23 to effectively represent biological sequences, accurately predict cell-specific enhancers, and discover regulatory vocabularies associated with TF binding (gkm-SVM)24,25,26 and protein motifs27. The effectiveness of this modeling agrees with the prevailing model that enhancers are defined by clusters of degenerate TFBS24,25,26,28. Gkm-align incorporates this idea and aligns human and mouse sequences by their gapped-kmer composition. Using enhancers of the 45 human/mouse cell/tissue pairs, we systematically evaluated the gkm-align algorithm, and discovered thousands of novel conserved enhancers. Further, we show that the discovery rate of conserved enhancers can further be increased by incorporation of gkm-SVM derived cell-specific regulatory vocabularies, which we show are conserved between human and mouse.

Results

Enhancer and promoter regulatory vocabularies are conserved, yet enhancers rapidly evolve

Enhancers are distal to transcription start sites (TSS) and harbor binding sites for TFs. The cell-specific expression of these TFs leads to cell-specific enhancer chromatin accessibility and transcriptional regulation. Over the past decades, the ENCODE consortium has generated thousands of DNase-seq experiments10,21,22, across diverse human and mouse cell/tissues, and this comprehensive library of experiments has allowed us to robustly and systematically identify enhancer elements. Chromatin-accessible peaks broadly fall into enhancer and promoter classes, which have different sequence features and conservation properties. To robustly define these classes, we use two genomic features: distance to the nearest TSS, and the cell specificity of DNase I hypersensitivity (Fig. 1A). We quantified the cell specificity of a DNase I hypersensitive site (DHS) as the fraction of all biosamples in the ENCODE database in which the DHS is inaccessible (Nhuman = 1270; Nmouse = 153). To discretize these classes, we classified all DHSs farther than 2 kilobases from the nearest TSS as distal (if not, proximal) and DHSs with cell specificity higher than 0.7 as cell-specific (if not, constitutive). This partitions DHSs into four classes: distal cell-specific, proximal constitutive, distal constitutive, and proximal cell-specific DHSs (section “Methods”).

Fig. 1: Cell-specific enhancer and promoter gkm-SVM regulatory vocabularies are conserved across mammals, while enhancers rapidly evolve.
figure 1

A B-lymphocyte DHSs (N = 55,715) distance to nearest TSS and cell-specificity across 1270 biosamples. B Classification of B-lymphocyte DHSs by their distances to nearest TSS (proximal: <2 kb, distal: > 2 kb) and cell-specificity (cell-specific: DHS in <30% of all biosamples, constitutive: otherwise); average epigenetic signals around DHS peak centers by DHS class. C Pairwise comparisons of kmer-weight vectors, derived from enhancers (cell-specific distal DHS), for a pair of similar samples: human and mouse adult brain (Nkmer = 2,097,152; Pearson Corr. = 0.74). D Pairwise comparisons of gkm-SVM kmer-weight vectors, by Pearson correlation, across various human embryonic biosamples (NESC = 11, NThymus = 9, NStomach = 15, NHeart = 19, NKidney = 32, NBrain = 14) trained on enhancers and promoters. E Human–mouse interspecies DHS prediction accuracy (AUROC) across various cell tissues (avg. between mouse element prediction using model trained on human elements and vice versa). F Schematic of enhancer mapping and conservation level of human fibroblast enhancer/promoter regulatory vocabularies (gkm-SVM interspecies enhancer prediction AUROC) and enhancer/promoter CREs (fraction of conserved enhancers mappable by LiftOver) in chimpanzee, gorilla, orangutan, rhesus, and mouse. G Distribution of DNase accessibility in human fibroblast enhancers and promoters (top 10,000 in each DHS class by total DNase-seq reads mapped) and distribution of DNase signal at primate and mouse loci mapped from human enhancers and promoters. Signals are normalized as fold changes from average fibroblast enhancers and promoter accessibilities in respective species (top 10,000 in each DHS class by DNase-seq read mapped). Boxplots represent quartiles and median, while the green dots represent the mean. Source data are provided as a Source Data file.

We observed distinct biochemical signatures in these four element classes. Distal cell-specific DHSs show strong markers of enhancer activity such as ChIP-seq signal for H3K4me1 and lack markers of promoter activity (POLR2A, H3K4me3) (Fig. 1B, Supplementary Figs. 1, 2) (section “Methods”; ENCODE experiment file IDs listed in the  Supplementary Data). By contrast, proximal constitutive DHSs have the highest level of chromatin accessibility among the four classes and display clear signatures of promoter activity (POLR2A, H3K4me3) and depletion of enhancer marks (H3K4me1). These classification criteria allowed us to robustly define enhancer elements without the need for diverse histone ChIP-seq experiments, which are currently unavailable for many biosamples assayed with the DNase-seq experiments. Further, many distal constitutive DHSs appear to be CTCF binding sites (a known regulator of chromatin topological organization), and proximal cell-specific DHSs show mixed signatures of enhancers and promoters, which emphasizes the utility of the two criteria (TSS distance; cell-specificity of chromatin accessibility) for precisely defining enhancer elements (Fig. 1A, B). For the rest of the paper, we will refer to distal cell-specific DHSs as enhancers and proximal constitutive DHSs as promoters.

Enhancer regulatory vocabularies, obtained through gkm-SVM training on enhancers, are cell-specific. Gkm-SVM is a sequence-based machine learning method that effectively distinguishes enhancers from inactive genomic elements by learning the weighted combination of gapped-kmers that predicts enhancers24. Gkm-SVM training assigns higher weights to gapped-kmers enriched in enhancers, and this information can be mapped to kmer weights, where kmers comprised of predictive gapped-kmers—or enhancer vocabularies—are assigned higher weights. The biological relevance of enhancer regulatory vocabularies has been demonstrated by their utility in predicting functional impacts of enhancer sequence variants25,26,28,29,30. All human and mouse gkm-SVM models used in this study are publicly available in the ENCODE portal (encodeproject.org), and their aliases and experimental inputs are listed in the Supplementary Data. Supplementary Fig. 3 describes how to access the gkm-SVM models through the portal.

As enhancer kmer-weight vectors encode DNA binding motifs of core TF regulators, they can be used to quantify how similar enhancer regulatory vocabularies are by cell/tissue and across species. For example, gkm-SVM models trained on human and mouse adult brain enhancers detect the same TFBS, with highly similar enhancer regulatory vocabularies (Fig. 1C; R = 0.74). One of the top predictive kmers (CACCAGATGGC) in both human and mouse brains is a TFBS for neurogenic TFs NEUROG1/2 and ATOH131,32,33. Another predictive kmer for both human and mouse brains (CCATGGCAACC) is bound by RFX family TFs, of which RFX3/RFX5/RFX7 are highly expressed in the human brain34, and their mutations are linked to intellectual and behavioral abnormalities35. On the other hand, kmers associated with TFs highly expressed in non-neural cell/tissues, such as AAAGAGGAAGT (SPI family; blood cell development36) and GGACTTTGACC (HNF4A; liver/pancreas37), have low weights in both human and mouse brain enhancer models. Cell/tissues of distinct identity have low enhancer kmer weight vector similarity (Supplementary Fig. 4; Rbrain vs monocyte, Rspinal cord vs macrophage, Rbrain vs liver = 0.085, 0.11, 0.19). Across a wider range of cell/tissues, biosamples of the same cell/tissue identity consistently show high kmer-weight correlation (mean R = 0.72) while pairs of kmer weights from distinct cell/tissues show lower correlation (mean R = 0.31) (Fig. 1D). By contrast, kmer weights obtained from promoters show low cell-specificity, having high kmer-weight correlation for all pairs of cell/tissues (same tissue mean: 0.80; distinct tissue mean: 0.75). This is consistent with past observations that enhancers are bound by TFs with cell-specific expression while promoter-binding TFs are relatively less cell-specific38.

In addition to weight vector similarity, conservation of enhancer regulatory vocabulary can be demonstrated by showing that models trained on human can predict enhancers in mouse, and vice versa. We trained gkm-SVM enhancer models for a set of human and mouse cell/tissues (brain, retina, B-cell, T-cell, intestine, stomach), and scored enhancers in the other species (i.e., train on human enhancers, predict on mouse enhancers & train on mouse, predict on human, and report the average AUROC) (Fig. 1E, section “Methods”). For matched cell/tissues, the accuracy of interspecies enhancer prediction is high (N = 6; mean AUROC = 0.91), while the prediction accuracy between distinct cell/tissues is low (N = 30; mean AUROC = 0.65). We verified that the high interspecies enhancer prediction accuracies for matched cell/tissues are not inflated through potential train–test set leakage from sequence conservation by showing that two train–test splitting methods, random and syntenic splits (designed to prevent leakage through orthology), show no difference in resulting AUROCs (section “Methods”; Supplementary Fig. 5C). On the other hand, interspecies prediction accuracies for promoters are high for both mismatched (e.g., human brain & mouse B-cell; N = 30; mean AUROC = 0.93) and matched cell/tissues (N = 6; mean AUROC = 0.94). These results are consistent with observations that expression of core TFs39,40 and their DNA binding affinities17 are well conserved between human and mouse, and with previous studies showing that machine learning models trained on human can predict enhancers in mouse41,42.

Although enhancer vocabularies are highly conserved, this observation is insufficient to map which specific orthologous enhancers are conserved between the two species, as all enhancers receive similarly high model scores. In addition, the evaluation performed above requires enhancer sets in both species. To clearly demonstrate the challenge of mapping evolutionarily related enhancers, we will first use a conventional mapping method (LASTZ/LiftOver) and compute the conservation rate of human fibroblast enhancers and promoters using a set of DHS data in fibroblasts from diverse mammals43 (chimpanzee, gorilla, orangutan, rhesus, and mouse10, in increasing divergence from human) (section “Methods”; Fig. 1F). To assess conservation rate, we compute the fraction of enhancers in one species that map to chromatin-accessible elements in the other species, and average the two reciprocal directions. In spite of the conserved regulatory vocabulary, about 40% of enhancers are conserved between humans and chimpanzees, and this value rapidly decreases to 11% in human and mouse as evolutionary distance to human increases (72% decrease; Fig. 1F). Promoter conservation rate also decreases from 89% between human and chimpanzee to 65% between human and mouse (but at a slower rate, 27% decrease). In contrast, regulatory vocabularies of both enhancers and promoters as quantified by gkm-SVM are constant across all species (Fig. 1F, human–mouse enhancer AUROC: 0.86; promoter: 0.89; human–chimpanzee enhancer AUROC: 0.85; promoter: 0.92). As an alternative metric, we can count the DNase read signal in each species’ orthologous loci relative to average elements in that class, as shown in Fig. 1G for human fibroblast enhancers and promoters, confirming the rapid reduction in enhancer conservation rate. Orthologous chimpanzee loci mapped from human enhancers and promoters respectively are on average 66% and 98% as accessible as average chimpanzee enhancers and promoters. These signals decrease dramatically to 21% (enhancer) and 86% (promoter) when we map human to mouse, further underscoring the rapid evolution of enhancers.

We used enhancer regulatory vocabulary to identify the most similar pairs of human and mouse cell/tissues for further quantitative assessment of conservation (Fig. 2). We generated a comprehensive list of 45 human/mouse cell/tissue pairs matched by gkm-SVM enhancer vocabularies (Supplementary Data; section “Methods”) and show that the cell-specific regulatory vocabulary is conserved across all 45 cell/tissues (Supplementary Fig. 6). We will also use this set of 45 pairs of cell/tissues to evaluate our novel genome alignment method, gkm-align, against conventional methods (Figs. 3, 4).

Fig. 2: Enhancer conservation levels are highly variable across cell types.
figure 2

A Human–mouse enhancer (distal cell-specific DHS) and promoter (proximal constitutive DHS) conservation rates (mappability to DHS by LASTZ/LiftOver genome alignment/mapping algorithms) in orthologous syntenic intergenic loci. Mean of human-to-mouse and mouse-to-human mappings across the 45 human–mouse cell–tissue pairs (A: adult, E: embryonic); sorted by enhancer mappability (identical orders for H, I). B Human–mouse conservation rate for enhancers, promoters, and CTCF ChIP-seq peaks for B-cell, brain, ESC, and heart. C Human–mouse enhancer conservation rate by alignment (mappability by LASTZ/LiftOver) vs human–mouse enhancer regulatory vocabulary conservation rate (correlation of gkm-SVM kmer weights) for pairs of orthologous gkm-SVM matched tissues (e.g., human and mouse brain; N = 45) and mismatched cell/tissues (e.g., human T-cell and mouse brain; N = 1980) (R = 0.78) D Schematics of human–mouse syntenic intergenic loci (HE human enhancer, ME mouse enhancer, HG human gene, MG mouse gene); comparison of human and mouse enhancer numbers in matched syntenic intergenic loci (N = 12,704) for embryonic brain (syntenic enhancer number constraint R = 0.91) and adult liver (R = 0.67) E Syntenic enhancer number constraint vs human–mouse enhancer conservation rate (N = 45, R = 0.78; linear regression line and 95% confidence interval). F Schematic describing how orthologous and paralogous enhancers are defined (HE human enhancer, ME mouse enhancer, G gene; line indicates sequence homology); dots represent 5000 human enhancers w/ highest DNase signal. Gapped-kmer sequence similarity with top-matched mouse enhancers vs with top-matched human enhancers. Enhancers in red-shaded regions are classified as paralogous enhancers; enhancers in blue-shaded regions are classified as orthologous enhancers. G Ratio of orthologous to paralogous enhancers across the 45 cell/tissue pairs (avg. of human and mouse) vs human–mouse enhancer conservation rate (by genome alignment) H Total fraction of enhancer base pairs annotated by each class of transposable elements (avg. of human and mouse). I Fraction of conserved enhancers and overlap with TEs (repeat = enhancers with more than 50% base pair overlap with type I TEs; non-repeat = enhancers with zero overlap with any repeat annotations). Source data are provided as a Source Data file.

Fig. 3: gkm-align algorithm identifies conserved enhancers by finding the optimal alignment path of maximum gapped-kmer similarity.
figure 3

A Sequence similarity between a mouse (M) enhancer and a human (H) enhancer is quantified by their similarity in gapped-kmer compositions (gapped-kmer similarity, or gkm-sim). B Schematic describing computation of pairwise gkm-similarity of all pairs of sliding windows in syntenic genomic loci of the two species. The pairwise similarity values are encoded in a gapped-kmer similarity matrix G. C Schematic describing how whole-genome alignment is performed using the GNA12 inversion locus as an example (dots: short sequence matches; colors: groups of short matches in syntenic blocks; boxes: pairs of human/mouse syntenic loci from which gkm-similarity matrices derive). Visualization of gapped-kmer similarity matrix (G) in FADS syntenic locus D without gkm-SVM repeat masking and E with masking. F Identification of colinear series of conserved elements using matrix G. G Alignment of the HBB Locus Control Region (dot size: gkm-similarity; color: gkm-SVM prediction score at corresponding human locus using gkm-SVM model trained on mouse embryonic liver enhancers; highlights: CREs); HE: human element; ME: mouse element. H Average DNase accessibility and H3K27ac/H3K4me1/H3K4me3 histone ChIP-seq signals at mouse loci mapped from human enhancers (aggregated across nine distinct cell/tissues) using gkm-align. Signals are normalized as fold change from genomic average. I Number of human–mouse conserved enhancers that are identifiable uniquely by LASTZ/LiftOver (x-axis) and gkm-align (y-axis) for each of the 45 cell/tissue pairs. Gkm-align identifies conserved enhancers missed by LASTZ/LiftOver in all tissues. Relative to LASTZ/LiftOver, gkm-align discovers 197 novel enhancers on average per cell/tissue and total 6591 novel enhancers across all 45 cell/tissues. Source data are provided as a Source Data file.

Fig. 4: gkm-align identifies more novel conserved enhancers and robustly predicts functional conservation when combined with cell-specific information.
figure 4

A Schematic describing how cell-specific gkm-SVM enhancer prediction model is incorporated into gkm-align for cell-specific weighted alignment. B Number of B-cell human enhancers mappable to mouse B-cell enhancers using LASTZ/ LiftOver (gray dashed line) and gkm-align weighted by gkm-SVM enhancer models trained on B-cell (red), myeloid progenitor cell (green), and thymus enhancers (blue) with varying weights. C Percent change in the number of identifiable human/mouse enhancer pairs by gkm-align, relative to LASTZ/LiftOver, across 45 cell–tissue pairs using cell-type matched gkm-SVM model weighting. Boxes show quartiles, with a line at the median; whiskers extend up to 1.5 times the interquartile range. Line plots for a subset of cell/tissues. Dotted line: same but using random syntenic mapping instead of gkm-align. D Number of human–mouse conserved enhancers that are identifiable uniquely by LASTZ/ LiftOver (x-axis) and gkm-SVM weighted cell-specific gkm-align (y-axis) for each of the 45 cell/tissue pairs. Relative to LASTZ/LiftOver, weighted gkm-align discovers 610 novel enhancers on average per cell/tissue and total 23,660 novel enhancers across all 45 cell/tissues (using the most optimal c for each tissue). E Schematic describing regression model for ranking enhancer mapping to mouse (mDNase) in terms of human enhancer features (qDNase = DNase signal at query human enhancer; gkm-sim = gapped-kmer sequence similarity between human and mouse element; PH(M) = human gkm-SVM score of mapped mouse element). F Predicting DNase-seq signal at mouse loci mapped from human enhancers (mDNase) using combinations of features described in (E) across the 45 human/mouse cell/tissue pairs. Boxes show quartiles, with a line at the median; whiskers extend up to 1.5 times the interquartile range. Source data are provided as a Source Data file.

Enhancer conservation is highly variable across cell types

In the 45 pairs of human and mouse cell/tissues, we observed intriguingly high cell/tissue-specific variability in enhancer conservation. As in Fig. 1F, we defined the conservation rate of human enhancers as the fraction of human enhancers that map to mouse DHS of the matched cell/tissue (e.g., human brain/mouse brain), constraining the mapping by LASTZ/LiftOver to syntenic intergenic loci (section “Methods”). Conservation rates of mouse enhancers were defined similarly, and human–mouse conservation levels were defined as the average of the two reciprocal directions. Promoters showed consistently high conservation rate across the 45 tissues (mean 67%) while enhancers showed a highly variable conservation rate, ranging from 6.7% (embryonic liver) to 31% (embryonic brain) (Fig. 2A). This cell-specific pattern of conservation persists even when we limit our quantification to enhancers with the highest DNase signals (Supplementary Fig. 7). Such lack of conservation is not observed in CTCF ChIP-seq peaks, which are also often distal to TSSs (conservation rate for brain, heart, B-cell, ESC = 82%, 79%, 72%, 76%; Fig. 2B), suggesting that many CTCF loops and topologically associated domains are conserved16,44,45,46. The strong and somewhat counter-intuitive tissue specificity of enhancer conservation will be explored extensively below.

We observed low enhancer conservation rates in some tissues in spite of the fact that their regulatory vocabulary is highly similar between human and mouse. To explore this systematically, we computed the similarity of enhancer regulatory vocabularies (measured as Pearson Corr. of enhancer kmer-weight vectors), for both matched and mismatched pairs of human and mouse cell/tissue pairs (e.g., matched: human brain & mouse brain; mismatched: human brain & mouse muscle; Nmatched = 45; Nmismatched = 452  − 45 = 1980; Fig. 2C). Overall, conservation rates of enhancers and similarity of enhancer regulatory vocabularies correlate highly (R = 0.78, Fig. 2C), indicating that human and mouse cell/tissues that share similar sets of core TF regulators also tend to share a higher number of orthologous enhancers. While enhancer regulatory vocabularies were overall highly similar for all matched human/mouse cell/tissue pairs (relative to the mismatched pairs), interestingly, their enhancer conservation rates varied widely (vertical spread of black points, Fig. 2C). For example, the human/mouse adult liver pair had an almost identical similarity of enhancer regulatory vocabulary (R = 0.72) as human/mouse embryonic brains (R = 0.71), but the adult liver enhancer conservation rate was 9.4% (<1/3 of the embryonic brain), even lower than the enhancer conservation rate between a mismatched pair of human embryonic muscle and mouse embryonic brain (12.7%). This implies that some cell/tissues, with conserved core TF regulators and DNA binding specificities, have experienced more incidence of enhancer turnover than other cell/tissues.

To eliminate the possibility that the highly cell/tissue-specific rate of enhancer conservation is a bias of the LASTZ/LiftOver alignment/mapping algorithms, we performed an orthogonal analysis. Using 12,455 syntenic intergenic loci of human and mouse derived from 15,500 orthologous protein-coding genes22 (Supplementary Data), we simply counted the number of human and mouse enhancers located in each of the matched syntenic intergenic loci (Fig. 2D; section “Methods”). We compare the correlation between the number of human and mouse enhancers in respective syntenic intergenic loci, which we will refer to as “syntenic enhancer number constraint,” and it imposes an upper limit for mappability of human/mouse enhancers in the matched syntenic loci. If the number of enhancers in syntenic intervals is not conserved, there is no way the enhancers can be conserved at the sequence level unless they arose through duplication. Embryonic human/mouse brain, which had the highest rate of enhancer conservation, also showed the highest level of syntenic enhancer number constraint (R = 0.93; Fig. 2D), while adult human/mouse liver had lower syntenic enhancer number constraint (R = 0.67), with occasionally drastically different numbers of enhancers in matched syntenic intergenic loci. This is consistent with reports of species-specific rewiring of transcription in the liver47,48 and a relatively slower rate of transcriptomic divergence of the brain across mammals49. The lack of syntenic enhancer number constraint appears in cell/tissues with low enhancer conservation level (Fig. 2E), and hints that the lower rate of conserved enhancers in some cell/tissues, as predicted by genome alignment, is an inherent property of the regulatory landscape.

This apparent lack of syntenic enhancer number constraint is largely driven by species-specific enhancer duplication, where cell/tissues with lower level of enhancer conservation tend to have a higher proportion of paralogous enhancers (see section “Methods”). To estimate the proportion of orthologous and paralogous enhancers in each human cell/tissue (in the context of human–mouse common ancestry), we labeled a human enhancer as paralogous if it has high sequence homology with another human enhancer but lacks sequence homology with any mouse enhancer, and similarly labeled it as orthologous if it has high sequence homology with a mouse enhancer but lacks homology with any other human enhancers (Fig. 2F; section “Methods”). Based on this criterion, of the 5000 human embryonic brain enhancers with the highest DNase I accessibility, 873 and 100 were identified as orthologous and paralogous enhancers (ratio: 8.73). In contrast, the adult liver had 320 and 905 orthologous and paralogous enhancers (ratio: 0.35). These estimated ratios of orthologous to paralogous enhancers, averaged between human and mouse, closely matched with the syntenic enhancer number constraint and with enhancer conservation rate for the 45 cell/tissue pairs (Fig. 2G; Supplementary Fig. 8), indicating that enhancer duplication events have been a significant contributor to the divergence in enhancer landscapes.

These duplication events are largely driven by TE. The paralogous enhancers show significant enrichment of LTR retrotransposons across diverse cell/tissues (Supplementary Fig. 9; section “Methods”), and we observed a general trend that the cell/tissue pairs with low enhancer conservation level tend to have high enrichment of TE (Fig. 2H—bars in the same order as Fig. 2A; Supplementary Fig. 15A; R = −0.88). Quantifying TE enrichment as the fraction of total enhancer base pairs that overlap with a TE, we observed that embryonic brain enhancers, averaged between human and mouse, had <10% enrichment of type I TE (LTR/LINE/SINE) while some cell/tissues, such as ESC, had TE enrichment as high as 33%. Overall, TE enrichment in enhancers appears lower than its genome-wide coverage (>40%), with SINE elements especially depleted in enhancers of most cell/tissues (Supplementary Fig. 10). However, interestingly LTR elements appear to be highly enriched in enhancers across multiple cell/tissues, and their enrichment grows in enhancers with the strongest DNase I accessibility (Supplementary Fig. 11). Although we do observe clear signals of DNase I accessibility in LINE elements (at its 5′ end) for multiple cell/tissues (Supplementary Fig. 13), these generate weak DNase I peaks, and LINE elements are depleted in enhancers of most cell/tissues (Supplementary Fig. 12). Like LTR, LINE enrichment increases with increasing DNase I accessibility, surpassing the genomic average for subsets of top 1000 enhancers of human colon and ESC with the highest level of DNase I accessibility (Supplementary Fig. 12). By contrast, SINE elements are more depleted in enhancers with higher DNase I accessibility (Supplementary Fig. 10). These observations are consistent with a recent report that also found significantly higher enrichment of LTR than SINE in distal enhancers50. This TE-specific and cell/tissue-specific variation in TE-enhancer association suggests that TEs may have a functional role in shaping the enhancer landscape51,52,53,54,55, but it is difficult to separate function from their naturally increased tendency to transpose into accessible regions.

Most enhancers with overlapping TE annotations are species-specific (Fig. 2I), and the increase in TE enrichment in enhancers explains much of the decrease in enhancer conservation across cell/tissues. This is not limited to weaker DHS, and in fact TE enrichment in enhancers grows with DNase I accessibility. We show this by repeating the analysis of Fig. 2I using only the top 1000 enhancers with the highest DNase signal (Supplementary Fig. 16). In summary, enrichment of TE-associated enhancers contributes heavily to the observed cell/tissue-dependent variability in enhancer conservation (orange bars in Fig. 2I and Supplementary Fig. 16 grow with decreasing conservation). However, intriguingly, we also observe a decreasing trend of conservation in non-TE-associated enhancers as TE enrichment in cell/tissues increases (gray bars in Fig. 2I and Supplementary Fig. 16 shrink with decreasing conservation; Supplementary Fig. 15B; R = −0.84). We find it curious that the increased enhancer duplication driven by TEs is correlated with the reduction in the conservation level of non-TE-associated enhancers. It is possible that TE-driven functional redundancy allows more rapid evolutionary turnover in these tissues.

gkm-align algorithm finds the alignment path of maximum gapped-kmer similarity

Identification of conserved enhancers in evolutionarily distant mammals, such as mice, is made difficult by their rapid evolution, and limitations of genome alignment algorithms may underestimate conservation. We do not know how significantly LASTZ/LiftOver alignment inaccuracy contributes to the low rates of conservation shown in Fig. 2. To address this issue, we next present a novel genome alignment algorithm that differs from previous methods by using gapped-kmer sequence features that more readily capture the functional elements of enhancer sequences.

Enhancers contain degenerate clusters of TFBS, and enhancer mutagenesis studies have shown that enhancer function is strongly affected by mutations within binding sites and robust to mutations between binding sites30,56. This modular architecture more readily tolerates insertions/deletions between TFBS and small structural variations. To exploit this modular structure, gkm-align uses a sequence similarity metric that compares a pair of sequences (e.g., width of 300 base pairs) by their gapped-kmer composition (Fig. 3A; Supplementary Notes 15 for algorithmic details). Alternative sequence similarity metrics based on kmer composition have previously been used to detect evolutionarily related sequences57,58,59 (typically using 6-mers), but these methods have not been applied to whole-genome alignments. We chose to generate whole-genome alignments using gapped-kmers because they more accurately model flexible combinations of TF-binding sites24,25,28. Gapped-kmers contain a fixed number of gaps, which represent any nucleotide, and compactly model degenerate positions in TFBS. There exist \(\left({ l \atop k }\right)4^{k}\) gapped-kmers with size l and k non-gapped positions (e.g., N = 5,406,720 for l, k = 11, 7), and the gapped-kmer similarity (gkm-sim) for a pair of sequences is computed as the cosine similarity of these vectors, each encoding the counts of gapped-kmers in the respective sequence.

To align a pair of human/mouse loci, we first compute a gkm-similarity matrix (G) of all pairs of sliding window subsequences of the human and mouse loci (Fig. 3B; Supplementary Note 2). The size of this matrix will depend on the locus size; for example, human/mouse loci of 20 kilobases (e.g., FADS loci) have ~1000 subsequences of 300 base pair windows sliding by 20 base pairs, and the gkm-similarity matrix (G) of dimension 1000 × 1000 encodes all pairwise gkm-similarities of the human and mouse subsequences (as shown in Fig. 3D). Exons of matched orthologous genes in human and mouse show highest levels of sequence similarity, but regions of low complexity (e.g., tandem repeats) also show high interspecies sequence similarity due to their prevalence and uniform sequence composition (identifiable as a row of horizontal dots in Fig. 3D). To remove these repetitive sequence matches, we train gkm-SVM to detect and mask sequence patterns that are ubiquitous across the human and mouse genomes (Fig. 3E; Supplementary Fig. 14; Supplementary Note 5). We observed that the optimal masking threshold that maximizes the mapping rate from human enhancers to mouse enhancers masks about 10% of the human and mouse genomes (Supplementary Fig. 17). We then compute G using the masked sequences, to which we apply a variant of Smith-Waterman algorithm to identify an optimal alignment path that encodes how human/mouse loci diverged (Fig. 3F; Supplementary Note 3). This method of alignment is extended genome-wide by utilizing orthologous gene annotations22 and short sequence matches19,60 (Fig. 3C; Supplementary Note 4). The gkm-align software package can be downloaded from https://github.com/oh-jinwoo94/gkm-align.

We next demonstrate the gkm-align algorithm at the well-studied hemoglobin beta (HBB) locus control region (LCR)61. These loci in human and mouse each contain 4–5 enhancers, and the human enhancers have shown to be capable of regulating mouse HBB expression through transgenic mouse experiments62. We aligned the HBB LCRs of human and mouse using gkm-align and mapped the five mouse enhancers to human (Fig. 3G). The five mouse enhancers (labeled as ME1, …, ME5) all have strong DNase I accessibility and EP300 binding in mouse embryonic liver (where HBB is active) and are bound by GATA163 in erythroblasts; however, only four of the five human loci mapped from these mouse enhancers (labeled as HE1, …, HE5) show strong marks of enhancers. HE5 has weak GATA1 binding in erythroblast and low DNase I accessibility in embryonic liver. Further, inhibiting HE5 with CRISPRi has the weakest effect on downregulating HBE1 expression among the five putative human enhancers mapped from mouse64,65 (Supplementary Fig. 18; section “Methods”). It is likely that HE5 may have accumulated mutations leading to loss of regulatory activity. This loss of regulatory activity is also predictable by gapped-kmer sequence similarity metrics. As will be justified further below, Supplementary Fig. 18 shows that the geometric mean of gapped-kmer sequence similarity and interspecies gkm-SVM enhancer prediction is consistent with the CRISPRi effect at HBB enhancers.

We applied gkm-align genome wide across the 45 cell/tissue pairs and identified many novel conserved enhancers which are not predicted by LASTZ/LiftOver (either predicted to be deleted or which LiftOver maps to inactive regions). Overall, the gkm-align predicted mouse enhancers show clear marks of enhancer activity (Fig. 3H): strong DNase/H3K27ac/H3K4me1 signals and weaker H3K4me3 (averaged across nine cell/tissues for which the mouse histone ChIP-seq data are available; the cell/tissue identities are shown in the Supplementary Data). For all the 45 cell/tissue pairs, gkm-align mapped a higher number of human enhancers to mouse enhancers than LASTZ/LiftOver, with the increase in enhancer mappability ranging from 1% (embryonic limb) to 22% (hematopoietic multipotent progenitor cells) (Figs. 3I and  4C). For the cell/tissue pair of human hippocampus astrocyte and mouse Müller cells (both of which are glial cells), gkm-align successfully mapped 791 human enhancers to mouse enhancers, which are incorrectly mapped by LASTZ/LiftOver (either deleted or mapped to inactive mouse regions). These novel conserved enhancers show clear markers of enhancer activity (Supplementary Fig. 19). Conversely, only 222 human enhancers were correctly mappable uniquely by LASTZ/LiftOver but incorrectly mappable by gkm-align. In total, 8559 human glial enhancers were mapped to mouse enhancers by both methods. We estimate that only 1.68% of these conserved enhancer mappings have resulted by chance (Supplementary Fig. 20; section “Methods”). Together across the 45 cell/tissues, gkm-align identified 6591 novel conserved enhancers. This greatly increases the number of human enhancers which can be functionally tested for disease relevance in mouse models.

gkm-align identifies additional novel conserved enhancers when combined with cell-specific vocabulary

Although gkm-align outperforms LASTZ/LiftOver, the sequence similarity metric does not explicitly make use of cell/tissue-specific regulatory vocabulary. Gkm-SVM enhancer regulatory vocabularies, encoding TFBS motifs, are well conserved between human and mouse (Fig. 1E, F), and they can be incorporated into gkm-align both to improve discovery of conserved enhancers and to quantify their predicted functional conservation. This additional information leads to an expanded catalog of human enhancers testable through mouse models, ranked by the likelihood of conserved regulatory function.

We incorporate cell-specific gkm-SVM regulatory vocabularies into gkm-align following a simple and intuitive model: if a pair of human and mouse enhancers, denoted as HE and ME, are orthologous, then they should have similar DNA compositions (i.e., general sequence conservation), and should both contain conserved TFBS motifs relevant to the shared cellular context (i.e., functional sequence conservation) (Fig. 4A). General sequence conservation (G) is quantified using gkm-similarity, as previously described. Functional sequence conservation (F) is computed using interspecies gkm-SVM prediction scores, which we normalize to vary between 0 and 1 for interpretability (Supplementary Note 5). Denoting \({P}_{H}({ME})\) as normalized prediction score of a mouse element by a gkm-SVM model trained on human enhancers, we can interpret \({P}_{H}({ME})\) as the probability that ME can function as an enhancer in the orthologous human cellular context. \({P}_{M}\left({HE}\right)\) is defined similarly (M: mouse; HE: human enhancer). Then, functional sequence conservation, computed as \(F={P}_{H}({ME})\cdot \,{P}_{M}({HE})\), can be interpreted as the probability that ME and HE can both function as enhancers interchangeably in human and mouse cellular contexts. For cell-specific weighted alignment, we combine the two measures of enhancer conservation into \({G}_{c}={G}^{1-c}\cdot {F}^{c}\,(0\le c\le 1)\), which adjusts the alignment path toward human/mouse element pairs with both similar sequence composition (G) and functional similarity in a common cellular context (F). As expected, the resulting alignments increasingly diverge from the cell-independent (c = 0) alignment paths as the degree of cell-specific weighting (\(c\)) increases (Supplementary Fig. 21A), but we also observe that enhancer mappings are largely consistent among different cell-type specific models (Supplementary Fig. 21B).

Cell-specific weighted alignment by gkm-align identifies the highest number of conserved enhancers when it is combined with gkm-SVM enhancer prediction model trained on enhancers of relevant cell/tissue type. For example, LASTZ/LiftOver and gkm-align each identify 1325 and 1529 conserved human B-cell enhancers, but if gkm-align is combined with B-cell trained gkm-SVM enhancer prediction models, the number of identifiable conserved enhancers increases up to 1818 at cell-specific enhancer model weighting parameter (c = 0.8) (a 37% increase from LASTZ/LiftOver) (Fig. 4B). The identification rate also increases when gkm-align is combined with gkm-SVM models of similar cell types (with overlapping TFs), such as myeloid progenitor cells and thymus, each with peaks at 1663 (c = 0.5) and 1605 (c = 0.4) conserved enhancers. Similarly, for colon enhancers, LASTZ/LiftOver, unweighted gkm-align (c = 0), and cell-specific gkm-align (c = 0.75) each identify 1125, 1221, and 1352 conserved enhancers, which corresponds to a 7.9% and 20% increase over LASTZ/LiftOver for unweighted (c = 0) and weighted (c = 0.75) gkm-align, respectively (Supplementary Fig. 22). As expected, such improvement is smaller when gkm-align is combined with less relevant gkm-SVM models (e.g., trained on proximal DHS; Supplementary Fig. 23). Cell-specific weighting using enhancer-trained gkm-SVM models improves the identification rate of conserved enhancers for all pairs of 45 cell/tissues (Fig. 4C). At c = 0.9, we observe up to an 80% increase in conserved enhancer discovery over LASTZ/LiftOver for monocytes, with 16 cell/tissues with >20% increase for c = 0.7 and c = 0.8. A subset of cell/tissues exhibited limited improvement through gkm-SVM weighting (e.g., brain), but their identification rates remained higher than both unweighted gkm-align and LASTZ/LiftOver at c = 0.5. Across the 45 cell/tissues, weighted cell-specific gkm-align discovers several hundred novel enhancers in every tissue and 23,660 total novel conserved enhancers across all 45 cell/tissues (Fig. 4D). We confirmed that such improvement using gkm-SVM interspecies enhancer prediction is not a result of train–test set leakage from sequence conservation (Supplementary Fig. 24; section “Methods”; consistent with Supplementary Fig. 5C), and further confirmed that using gapped-kmers (l = 11, k = 7) instead of kmers (l = 6, k = 6) for computing gapped-kmer similarity and for gkm-SVM weighting leads to higher conserved enhancer mapping (Supplementary Fig. 25; consistent with Supplementary Fig. 5D). Despite the increased performance by gkm-align with or without gkm-SVM cell-specific weighting, the cell-specific pattern of enhancer conservation we showed in Fig. 2A is maintained (Supplementary Fig. 26). Lastly, it should be noted that, although cell-specific alignment discovers a higher number of conserved enhancers compared to unweighted gkm-align, there are also handful of enhancers identifiable by unweighted gkm-alignment but missed by cell-specific gkm-alignment (Supplementary Fig. 27). These tend to be conserved enhancers with high general sequence similarities but with more degenerate TFBS. For this reason, both cell-independent and cell-specific gkm-align should be used in parallel to maximize the span of identifiable conserved enhancers, and we provide genome-wide enhancer mappings for both methods.

We reproduced the above results using an independent dataset66, where enhancers in diverse non-human mammals in four tissues (brain, muscle, liver, testis) were identified using H3K27ac and H3K4me1 histone marks. The four tissues are also included among our list of 45 cell/tissues, and we evaluated gkm-align by mapping our enhancers (distal cell-specific DHS) in these tissues to the histone-defined enhancers. Consistent with the above results (Fig. 4B, C), gkm-align maps higher numbers of conserved enhancers than LASTZ/LiftOver across four tissues, where gkm-align’s performance maximally improves when tissue-matched gkm-SVM enhancer models are incorporated (Supplementary Fig. 28). Using the same dataset, we also show that gkm-align also maps higher numbers of conserved human enhancers in rhesus macaque for all the four tissues (Supplementary Fig. 29). For this analysis, we used gkm-SVM models trained on our mouse enhancer sets to align human and macaque, which highlights the applicability of gkm-SVM models in different species due to conservation of enhancer vocabularies (Fig. 1E, F).

Cell-specific information from gkm-SVM enhancer prediction models can also be combined with the gapped-kmer based sequence similarity metric (gkm-sim) to quantify functional conservation of orthologous enhancer pairs discovered through gkm-align. The strength (DHS signal) of mouse loci mapped from human enhancers tends to correlate with strength of the human enhancers, but a fraction of the identified orthologous pairs lacks such conservation of activity due to sequence divergence (Fig. 1G). To rank predictions, we explored different ways to predict functional conservation between orthologous human and mouse enhancer pairs identified by gkm-align (Fig. 4E). We observed that the DNase signal of a query human sequence (qDNase) was correlated with the DNase signal of the mapped mouse ortholog (mDNase) with median correlation of 0.29 (min: 0.13; max: 0.46), and the product of qDNase and gkm-similarity between query human enhancer (HE) and the mapped mouse ortholog (ME) leads to increased median correlation of 0.41 (min: 0.23; max: 0.60) (Fig. 4F). When this product is further multiplied by enhancer prediction of ME by human-trained gkm-SVM (\(0\le {P}_{H}({ME})\le 1\)), median correlation further improves to 0.46 (min: 0.25; max: 0.61). A mapped mouse ortholog has a high value for this triple product if it exhibits a gapped-kmer composition similar to that of a human enhancer ortholog with high DNase activity and contains conserved TFBS motifs. Training a regression model with these combinatorial features leads to median correlation of 0.55 (min: 0.32; max: 0.68) (Fig. 4F; section “Methods”). All of our human/mouse conserved enhancer mappings are ranked with these gapped-kmer-based conservation scores, which we believe will facilitate downstream experimental testing by providing confidence ranking scores for functional conservation in mice.

Many of the novel enhancers identified by gkm-align are supported by additional evidence of conserved function. For example, gkm-align predicts a conserved enhancer in OTU Deubiquitinase 7A (OTUD7A) which is highly expressed in both human and mouse brains67 (Supplementary Fig. 30) and is associated with a wide range of neurological diseases as such schizophrenia and epilepsy68. OTUD7A knockout leads to morphological deformation of cortical neurons and frequent seizure-like events in mice69,70. We identified orthologous pairs of putative OTUD7A enhancers in intron 1 of human and mouse OTUD7A (Fig. 5A, B; enhancers: yellow-highlighted; hg38/chr15:31740273–31740573; mm10/chr7:63554547–63554847). Both human and mouse elements exhibit strong DNase I accessibility and H3K27ac histone modification across biological samples related to the nervous system (Fig. 5A, B). The two enhancers appear to have three clusters of conserved DNA base pairs with high local conservation in gapped-kmer composition (local conservation rate represented as the logo heights in Fig. 5C), and one of the clusters located at the centers of the human and mouse enhancers contains a NEUROG/ATOH1 binding motif (GCAGATGG), which is identified among the top brain enhancer kmer weights for both human and mouse as shown in Fig. 1C. This part of the enhancer has the largest delta-SVM25 score for gkm-SVM models trained on both human and mouse brains (visualized using shaded line plots in Fig. 5C; section “Methods”), indicating that it is a core TF-binding site conserved between human and mouse. Despite the clear conserved biochemical signatures and binding motif, this enhancer is predicted to be deleted in mouse by LASTZ/LiftOver (Supplementary Fig. 31).

Fig. 5: Examples of novel enhancers from the expanded catalog of human/mouse orthologous enhancers.
figure 5

A Genome browser visualization of mouse and B human OTUD7A loci. Yellow boxes indicate the identified conserved enhancers. C Visualization of conserved binding sites in human OTUD7A intronic enhancer by sequence conservation with the orthologous mouse enhancer. Logo height represents the local gapped-kmer sequence conservation score with mouse, and line plots indicate TF-binding prediction by delta-SVM models trained on human or mouse brain enhancers. D Visualization of DNase accessibility of orthologous enhancers from the five distinct cell/tissues (green: human, blue: mouse; position relative to DHS center of human/mouse orthologous enhancers; signal: fold change from genomic average).

To show further examples of the top conserved enhancers in Fig. 5D, we ranked enhancers that have the strongest combined (i) DNase I accessibility, (ii) gkm-similarity, and (iii) interspecies gkm-SVM prediction, using the regression score described in Fig. 4F. This combined regression score increases the likelihood of functional conservation as shown in Supplementary Fig. 18 for HBB CRISPRi. We ranked enhancers collected from five diverse human and mouse cell/tissues (brain, kidney, stomach, muscle, B-cell), and identified the top 1% conserved enhancers with highest regression score. Among these top orthologous enhancers, we selected a subset of enhancers in the vicinity of orthologous genes with cell/tissue-specific expression67 (Fig. 5D; Supplementary Fig. 30). These genes include KCND3 (brain; voltage-gated potassium channel subunit), PAX2 (kidney; TF associated with renal malformation71), GATA6 (stomach; definite endoderm TF16), MYOD1 (muscle; TF associated with myopathy72), and IKZF3 (B-cell; TF mutated in leukemia73,74). For each of the 45 cell/tissue pairs, we generated a table of ranked orthologous human-to-mouse enhancer pairs. In addition to providing an expanded catalog of conserved distal enhancer elements, the ranking can be used to prioritize elements for functional characterization. We uploaded our catalog of conserved enhancers to beerlab.org/gkmalign/.

Discussion

Model organism studies have elucidated many of the mechanisms of transcriptional regulation and have functionally validated the roles of enhancers associated with human diseases. However, characterization of enhancers through model animals is possible only for enhancers with identifiable orthologs. Mice are perhaps the most facile model system for human disease, but their evolutionary distance poses more of a challenge than detecting orthologous regulatory sequence in primates (Fig. 1F, G). Past studies using conventional genome alignment and mapping algorithms (LASTZ/LiftOver) have shown that mapping orthologous enhancers with conserved regulatory activities is much more difficult than mapping promoters. We addressed whether the difficulty of enhancer mapping is a result of rapid enhancer evolution or of limitations in conventional genome alignment algorithms. To comprehensively quantify enhancer conservation, we used gkm-SVM to generate unbiased sets of enhancers for each pair of 45 human and mouse cell/tissues matched by core TF regulators (Supplementary Data). Interestingly, enhancers appear to have highly variable levels of conservation across different cell/tissue types. We show that the conservation rate of embryonic brain enhancers is about three times higher than that of the adult liver, although enhancer vocabularies are highly conserved for both tissues (Fig. 2C). The tissue variability in enhancer conservation rate is confirmed in alignment-free analyses (Fig. 2D–G). Part of the explanation of this apparent paradox is that tissues with low enhancer conservation also have more species-specific enhancer duplication. A significant proportion of enhancer duplications appear to have resulted from TE, especially LTRs, and most TE-associated enhancers are species-specific (Fig. 2H, I), partly explaining the relative lack of enhancer conservation in cell/tissues with high TE activity, such as the liver. The cell/tissue variability in enhancer conservation aligns with previously reported cell/tissue variable conservation of gene expression, which showed rapid transcriptomic divergence of the liver relative to the brain49. Rapid liver enhancer evolution may explain the apparent transcriptomic differences between human and mouse livers47,48, and may further explain the previously reported transcriptional divergence of many cell/tissue-specific genes75. Intriguingly, in cell/tissues with low enhancer conservation, enhancers with no sequence overlap with repetitive elements also showed reduced conservation (Supplementary Fig. 15B). Our observation is consistent with a model of evolution in which TEs provide an influx of novel TF-binding sites through transposition and further facilitate turnover of nearby enhancers by supplying functional redundancy76. TEs can contribute up to 30% of the strongest DHS peaks in some tissues, but we do not know what proportion of these TFBS-carrying TEs act as transcriptional enhancers to relevant genes, and which might be regulatory noise. Only a small proportion of TE-enhancers have so far been functionally tested55, but we expect to see more functional validation of these elements in the near future due to advances in non-coding CRISPR-based screening methodologies65,77.

We developed a gapped-kmer-based novel alignment algorithm to detect conserved enhancers, gkm-align. Gkm-align maps orthologous enhancers by finding alignment paths of maximal gapped-kmer composition at the resolution of sliding ~300 base pair windows. We used a whole-genome alignment strategy and present a new set of conserved enhancer predictions for human and mouse. We evaluated these predictions on 45 pairs of matched tissues using ENCODE data and show that gkm-align detects thousands of conserved enhancers missed by conventional alignment methods (Fig. 3I). We further extend these predictions by combining tissue-specific TF information, which predicts an additional 500 enhancers per tissue on average, and up to an 80% increase in some tissues (Fig. 4B–D). While our analysis confirms that mapping orthologous enhancers between distant mammals is an inherently difficult problem due to rapid enhancer evolution, we show that we detect conserved enhancers of biomedical significance missed by LiftOver/LASTZ, including an intronic enhancer of OTUD7A which is associated with epilepsy in humans and reduces dendritic density and promotes seizures in mouse when knocked out. Many multiple alignment algorithms78,79,80 build upon pairwise alignment outputs from LASTZ, and we expect that the improvement of pairwise genome alignment by gkm-align will lead to improved annotations of conserved CREs in diverse mammalian genomes that are not functionally characterized as deeply as human and mouse. Despite the algorithmic improvement for mapping orthologous enhancers, our analysis confirms the overall weak enhancer conservation relative to promoters, and that enhancers have surprisingly variable conservation rate across cells/tissues. Lastly, we provide an expanded catalog of orthologous human–mouse enhancers, each annotated with predictive gapped-kmer-based functional conservation scores. We expect that this expanded and quantitatively ranked catalog of conserved enhancers will facilitate discovery and functional characterization by prioritizing enhancers for testing in model animals (Figs. 4E, F and 5).

Methods

Generating enhancer and promoter sets from DNase-seq data and gkm-SVM training

For all the ENCODE DNase-seq data used in this study, DNase-seq filtered alignment bam files were downloaded from the ENCODE portal https://www.encodeproject.org/. For the primate fibroblast DNase-seq data, primate fibroblast DNase-seq raw fastq files (under GEO accession GSE129034) were downloaded from ref. 43 and mapped to chimpanzee (panTro6), gorilla (gorGor5), orangutan (ponAbe3), and rhesus (rheMac8) genomes using bowtie281 (-L 20). DHS peaks were called by running MACS282 on bams from combined replicates using default parameters but more stringent p value (-p 1e − 9). Full length 300 bp peaks were generated by extending ±150 bp from MACS2 summits.

Promoter sets were generated by selecting all peaks within 2 kb of any annotated TSS. The remaining distal cell-specific peaks were further filtered by removing all peaks called in more than 30% of ENCODE DHS samples (hg38ubiq30.bed and mm10ubiq30.bed). This definition of promoters and enhancers was used to train the promoter and enhancer gkm-SVM models that are publicly available on the ENCODE portal (Supplementary Data for ENCODE accession IDs). To define primate enhancers, coordinates in hg38ubiq30.bed were mapped to the respective primate genomes by LiftOver. For Figs. 1 and 2, a more stringent definition was used for defining promoter elements to filter out proximal DHSs that are accessible in <30% of DHS samples.

For gkm-SVM model generation, gkm-SVM models were trained on top 10,000 300 bp enhancers or top 10,000 300 bp promoter peaks (with highest DNase-seq MACS2 peak score) vs GC and repeat matched random sequence following refs. 24,25,38 using default parameters (-l 11 -k 7).

To generate gkm-SVM kmer-weight vectors, a fasta file containing all 11-mers was generated using nrkmers.py from the lsgkm software package (v0.1.1) package83. Kmer-weight vectors for each 1270 human and 153 mouse gkm-SVM enhancer models were generated using the lsgkm package’s gkmpredict on the 11-mer fasta file using respective gkm-SVM models.

Visualization of epigenetic signals at DHSs

We used deepTools’s computeMatrix (v3.5.1)84 to compute average epigenetic signals across all DHS for each −1000 to 1000 base pair positions (resolution: 10) relative to the DHS center using the following command:

computeMatrix reference-point -S example.bigWig -R example.bed -a 1000 -b 1000 --referencePoint center -bs 10 -o example.tab.gz

ENCODE ChIP-seq and DNase-seq bigwig files used for this analysis (Fig. 1B and Supplementary Figs. 1, 2) are listed in the Supplementary Data. We normalized signals at each locus by computing the fold change over the genomic average.

Interspecies cis-regulatory element prediction

Top 10,000 elements for each DHS class (enhancer, promoter) in human and mouse were used as positive set, and randomly sampled 300 bp genomic loci with matched GC-content and repeat annotations were used as negative set. Gkm-SVM model trained in one species was used to make prediction for elements in each positive and negative set in the other species. AUROCs shown in Fig 1E, F were computed from these interspecies prediction values.

Evaluating gkm-SVM enhancer prediction and gkm-SVM weighted gkm-align using random and syntenic train–test set splitting methods

To ensure that our gkm-SVM prediction values are not affected by possible train–test set leakage from sequence conservation, we compared two train–test set splitting methods: random split and syntenic split. The random splitting method randomly partitions human enhancers into five subsets (H1, …, H5) and similarly partitions mouse enhancers into five subsets (M1, …, M5). For computing the accuracy of human-trained gkm-SVM to predict mouse enhancers, we trained on the union set of H1, …, H4 and made predictions on M5 and so on. The syntenic split method uses local gene synteny to define training–test subsets. For each syntenic human intergenic locus (listed in the  Supplementary Data), we require that the syntenic mouse interval is in the same subset (e.g., H1 and M1). Thus, we partition human and mouse genomes so that all the mouse genome sequences syntenic to human enhancers in one subset (e.g., H1) are contained the same matched mouse subset (e.g., M1) (Supplementary Fig. 5A). With this syntenic split strategy, orthologous mouse regions in one subset will never be evaluated using a model trained on the syntenic human intervals. To generate syntenic splits of nearly equal size, we concatenate syntenic human–mouse regions on the same chromosome pairs (e.g., chr2–chr1), rank by total size, and assign to a set in descending order (1,2,3,4,5,1,2,3,4,5…) (Supplementary Fig. 5B; Supplementary Data).

To evaluate gkm-SVM interspecies enhancer prediction, we train gkm-SVM using human sequences in H1–H4 as the positive training set and GC/repeat matched random genomic sequences in H1–H4 as the negative training set. Then, we evaluate gkm-SVM by using mouse sequences in M5 as the positive test set and GC/repeat matched random genomic sequences in M5 as the negative test set, etc. As shown in Supplementary Fig. 5C, prediction accuracies of gkm-SVM using the random and the syntenic train–test split methods show no difference.

For syntenic split gkm-align, we use gkm-SVM trained on H1–H4 and M1–M4 to align syntenic intergenic loci in H5/M5 and so on. For random split gkm-align, we simply train gkm-SVM on randomly selected 80% of human and mouse enhancers to match the training set size in the syntenic split, and combine the resulting model for aligning all syntenic intergenic loci. As shown in Supplementary Fig. 24, the two methods of training and applying gkm-SVM models to gkm-align show no difference in performance.

Generation of 45 human–mouse cell/tissue pairs

For each human sample (N = 1270; Supplementary Data), we identified its best-matching mouse sample (N = 153; Supplementary Data) by finding the mouse sample with the highest gkm-SVM kmer weight Pearson correlation. The best-matching human sample for each mouse sample was identified similarly. The 45 human–mouse cell/tissue pairs were derived as those human and mouse samples that reciprocally mapped to each other by top-matched kmer weights (Supplementary Data).

Defining orthologous syntenic intergenic loci in human and mouse

The list of all human–mouse orthologous protein-coding gene pairs (N = 15,712) was obtained from the mouse ENCODE consortium publication22 (Supplementary Data). In total, 15,500 out of 15,712 gene pairs with their human and mouse gene IDs also present in the Ensembl database were used, and the remaining 212 genes were filtered out (Homo_sapiens.GRCh38.96.chr.gtf, Mus_musculus.GRCm38.96.chr.gtf). Coordinates of these conserved human and mouse 15,500 genes were extracted from the Ensembl gtf files.

To identify all human–mouse syntenic intergenic loci, we first identified all neighboring pairs of human protein-coding genes conserved in mouse. Denote such human gene pair as HG1 and HG2. If their mouse gene orthologs, MG1 and MG2, are also neighbors in the mouse genome and if the relative transcriptional directions of [HG1 and HG2] and [MG1 and MG2] are also preserved (i.e., HG1/HG2 and MG1/MG2 both have tandem, convergent, or divergent transcriptional directions), we label [HG1/HG2, MG1/MG2] as human–mouse syntenic neighboring gene pairs (N = 12,455). Human syntenic intergenic locus was defined as the union of genomic space between a pair of human syntenic neighboring genes and their gene bodies, and mouse syntenic intergenic locus was defined similarly. This led to 12,455 pairs of human and mouse genomic regions that we call “human–mouse syntenic intergenic loci” (Supplementary Data).

Using LASTZ/LiftOver for estimating the conservation rate of cis-regulatory DNA elements

LASTZ (v1.03.66) chain files for human–mouse genome alignment were downloaded from these links (June 2020):

https://hgdownload.soe.ucsc.edu/goldenPath/hg38/vsMm10/hg38.mm10.all.chain.gz

https://hgdownload.soe.ucsc.edu/goldenPath/mm10/vsHg38/mm10.hg38.all.chain.gz

LiftOver software was downloaded from this link (June 2020):

https://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/liftOver

Minmatch=0.01 and -multiple options were used for high sensitivity and to allow duplicate mappings.

Command to map human elements to mouse:

liftOver h_elems.bed hg38.mm10.all.chain.gz h_elems_mapped.bed h_elems_not_mapped.bed -minMatch=0.01 -multiple

Command to map mouse elements to human:

liftOver m_elems.bed mm10.hg38.all.chain.gz m_elems_mapped.bed m_elems_not_mapped.bed -minMatch=0.01 -multiple

With these settings, LASTZ/LiftOver maps a human DHS (e.g., human brain enhancer) to 1≥ mouse locus, and we say it is conserved if at least one of the mouse-mapped loci overlaps (≥1 bp) with a mouse DHS in the matched cell/tissue in the matched syntenic intergenic loci. The human DHS conservation rate is defined as the fraction of conserved human elements, and mouse DHS conservation rate is computed similarly. Human–mouse conservation rate is defined as the average of the two rates.

Sequence homology analysis for quantifying the proportions of orthologous and paralogous enhancers

Our usage of the terms orthologous and paralogous enhancers is an estimate used to understand these observations, and is not meant to reflect an evolutionary process. Sequence similarity between a pair of enhancers was quantified as the similarity in gapped-kmer composition (gkm-similarity; defined in main Fig. 3A and Supplementary Note 2) using gkmSVM24 R software package (v0.81.0). Prior to computing gkm-similarities, we masked portions of enhancer sequences that are predicted to be highly prevalent genome wide to prevent trivial sequence matches by ubiquitous sequence patterns such as the low-complexity repeats (Supplementary Note 5; Supplementary Fig. 14). About 10% of enhancer sequence base pairs were masked and replaced with random base pairs.

To compute proportions of human orthologous and paralogous enhancers in each human/mouse cell/tissue pairs (e.g., human brain and mouse brain), we obtained the top 5000 enhancers with the highest DNase I signals (MACS2 peak score) and computed their pairwise sequence similarity with all other top 5000 human enhancers and with every mouse enhancer of the matched cell/tissue type. If the number of mouse enhancers in the larger set exceeds 50,000, we cut it off at the top 50,000. Using these values, we identified top-matched human and mouse enhancers that are most similar to each of the top 5,000 human enhancers. Sequence similarities with top-matched human and mouse enhancers were then used to classify a human enhancer as orthologous, paralogous, or neither. Denoting top sequence similarity with human enhancers and with mouse enhancers as y and x respectively, we classified human enhancers according to the following classification rules:

$$\left\{\begin{array}{c}{{{\rm{if}}}\,}x,y < b\hfill {{{{\rm{no}}}}\; {{{\rm{sequence}}}}\; {{{\rm{homolog}}}}}\\ {{{\rm{else}}}}\left\{\begin{array}{cc}{{{\rm{if}}}\,}y > {ax}+b\left(1-a\right) & {{{\bf{paralogous}}}}\\ {{{\rm{if}}}\,}y < \frac{1}{a}x+b\left(1-\frac{1}{a}\right) & {{{\bf{orthologous}}}}\\ {{{{\rm{else}}}}}\hfill & {{{\rm{ambiguous}}}}\end{array}\right.\end{array}\right.$$
(1)

where (a, b) = (4, 0.1). The decision regions for orthologous and paralogous enhancers are each shaded blue and red in Fig. 2F. Mouse enhancers were classified similarly.

Annotations for repetitive DNA elements

Annotations for repetitive elements were downloaded from the following links85:

http://hgdownload.cse.ucsc.edu/goldenpath/hg38/database/rmsk.txt.gz

http://hgdownload.cse.ucsc.edu/goldenpath/mm10/database/rmsk.txt.gz

Quantifying enhancer strength of human loci mapped from mouse HBB enhancers using CRISPRi perturbation data

HCR-FlowFISH CRISPRi data (K562 cell line; HBE1 expression perturbation as readout) were downloaded from the ENCODE portal (accession ID provided in the  Supplementary Data)64,65. Two biological replicates were used, where each replicate generates sgRNA sequence read counts for low and high expression sort bins. The following equation was used to compute the CRISPRi effect size of each sgRNA.

$$\,{\log }_{2}F{C}_{i}={\log }_{2}\,\left(\frac{1+\left(\frac{{L}_{i}}{{{{\rm{mean}}}}\left(L\right)}\,\right)}{1+\left(\frac{{H}_{i}}{{{{\rm{mean}}}}\left(H\right)}\right)\,}\right)\,$$
(2)

where L and H are each a vector encoding the number of reads for each sgRNA in low and high sort bins, respectively. Normalization with mean underweights sgRNAs with low read counts65. Enhancer strength of putative human enhancers, mapped from mouse HBB enhancers using gkm-align, were computed as average log2FC of sgRNA targets that overlap with each enhancer. Mouse enhancer coordinates were defined using mouse embryonic liver DHS (ENCFF578VRG; 300 base pair wide extended from the summit) that also overlap with GATA1 ChIP-seq peaks in mouse erythroblast (ENCFF676TDJ) within mm10/chr7:103851395–103883181.

Estimating the probability of mapping conserved enhancers by chance

We compared gkm-align against random mapping in syntenic intervals, so that both gkm-align and the random mapping map human enhancers to the same amount of mouse genomic space. Using the list of syntenic intergenic loci (Supplementary Data), we mapped all human enhancers contained in human/mouse syntenic intergenic loci to random positions within the matched mouse syntenic locus. We repeated this process 100 times for each of the 45 cell/tissue pairs (Supplementary Fig. 20). The probability of mapping conserved enhancers by chance was estimated by computing the average mapping rate of the random control across the 45 cell/tissues.

Regression model for predicting functional conservation

For predicting functional conservation of human enhancers in mouse (measured as DNase signal at mouse loci mapped from human enhancers), we denote a set of human enhancers as \(\{H{E}_{1},H{E}_{2},\ldots \}\) and denote mouse elements mapped from query \({HE}\) by gkm-align as \(M{E}_{i}\). Denoting DNase signal at query \({HE}\) as \({q}_{{sig}}\) and signal at mapped \({ME}\) as \({m}_{{sig}}\) (fold change relative to genomic average), we model functional conservation as:

$${m}_{{sig}}=f\left({q}_{{sig}},g,p\right)={\sum}_{{i}_{1}\in \{0,1\}}{\sum}_{{i}_{2}\in \{0,1\}}{\sum}_{{i}_{3}\in \{0,1\}}{\alpha }_{{i}_{1},{i}_{2},{i}_{3}}\cdot \left({q}_{{sig}}^{{i}_{1}}\cdot {g}^{{i}_{2}}\cdot {p}^{{i}_{3}}\right),$$
(3)

where \(g={gkmsim}(H{E}_{i},{M}{E}_{i})\), \(p={P}_{H}(M{E}_{i})\), and f linearly combines all multiplicative combinations of the three variables (Fig. 4E). Predicted mouse regulatory signal \({\hat{m}}_{{sig}}\) is computed using fivefold CV linear regression. \({q}_{{sig}}\cdot g\cdot p\), which also has high predictability (Fig. 4F), may also be used instead, if \({m}_{{sig}}\) is not available (e.g., no relevant functional experiments performed in mice).

Visualizing conserved TF-binding sites

In Fig. 5C, we visualized the orthologous OTUD7A enhancers using delta-SVM25 (binding site prediction; shaded line plot) and gkm-align (conservation; logo height) using the “visualize_mappings” software in the gkm-align software package. Delta-SVM scores for each base pair in the human OTUD7A enhancer (chr15:31740273–31740573) were computed using gkm-SVM models trained on human brain enhancers (gkm-SVM alias: DHS_518_hg38) and mouse brain enhancers (gkm-SVM alias: DHS_123_mm10). Further details of these enhancer models are given in the  Supplementary Data. The delta-SVM scores were computed as follows. First, to compute the delta-SVM score of a base pair within an enhancer, we first computed gkm-SVM prediction scores for the original enhancer sequence by adding up kmer weights of all kmers contained in the enhancer (\({Y}_{{{\rm{initial}}}}\)). We then computed gkm-SVM prediction scores for the enhancer with a single nucleotide variation (SNV) at the base pair of interest. There are three possible SNV and therefore three possible altered gkm-SVM scores (\({Y}_{{{\rm{fina}}}{l}_{1}},{Y}_{{{\rm{fina}}}{l}_{2}},{Y}_{{{\rm{fina}}}{l}_{3}}\)). For visualization, we computed the average delta-SVM score (\(\frac{1}{3}{\sum }_{i=1}^{3}{Y}_{{{{\rm{final}}}}_{i}}-{Y}_{{{\rm{initial}}}}\)).

We visualized the local conservation of gapped-kmer composition at each base pair of the conserved human enhancer by estimating the contribution of each base pair to the overall gapped-kmer similarity between the human and mouse enhancer. We first repeat-masked human and mouse base pairs to filter out base pairs that are overrepresented in the human and mouse (as done in Fig. 3E; Supplementary Note 5; Supplementary Fig. 14). Then, we ran gkm-align at the level of a single enhancer (gkm-align parameter setting: w = 11, s = 1, l = 11, k = 7), as opposed to running gkm-align at the level of a syntenic intergenic locus for conserved enhancer mapping (w = 300, s = 20, l = 11, k = 7; e.g., Fig. 3E). For enhancers with width 300 bp, this generates a gapped-kmer matrix \(G\) (dimension: \(290\times 290\)). Subsequently, the matrix values were then mapped to each base pairs in the human enhancer by computing the sum of matrix columns (\(v\in {{\mathbb{N}}}^{290},\,{v}\left[i\right]={\sum }_{j=1}^{290}\,G\,[i,\, j]\)). Each value in \(v\) corresponds to a kmer in the human enhancer, and the gapped-kmer contribution from each base pair was estimated by averaging the values of \(v\) that correspond to kmers that overlap with the base pair (i.e., for contribution for the ith base pair, \({c}\left[i\right]=\frac{1}{l}{\sum }_{i=1}^{l}v\,[i-l+i]=\frac{1}{l}{\sum }_{i=1}^{l}{\sum }_{j=1}^{290}G\,[i-l+i,\,j]\)). For clearer visualization the conserved OTUD7A enhancer, we transformed matrix \(G\) so that matrix elements of G outside its optimal alignment path (P) are zero (i.e., \(\widetilde{G}\,[i,\,j]=\left\{\begin{array}{c}G\,\left[i,\,j\right]{\,{if}\,}\left(i,\;j\right)\in P\\ 0{\,{if}\,}\left(i,\,j\right)\,\notin \,P\end{array}\right.\)). The logo heights in Fig. 5C are computed as \(\widetilde{c}\left[i\right]=\frac{1}{l}{\sum }_{i=1}^{l}{\sum }_{j=1}^{290}\widetilde{G}[i-l+i,\, j]\).

Additional algorithmic details of gkm-align are provided in the Supplementary Notes in the Supplementary Information file.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.