Background

Repbase and conserved noncoding elements

Repbase is now one of the most comprehensive databases of eukaryotic transposable elements and repeats [1]. Repbase started with a set of just 53 reference sequences of repeats found in the human genome [2]. As of July 1, 2017, Repbase contains 1355 human repeat sequences. Excluding 68 microsatellite representatives and 83 representative sequences of multicopy genes (72 for RNA genes and 11 for protein genes), over 1200 human repeat sequences are available.

The long history of research on human repeat sequences resulted in a complicated nomenclature. Jurka [3] reported the first 6 “medium reiterated frequency repeats” (MER) families (MER1 to MER6). MER1, MER3 and MER5 are currently classified as the hAT superfamily of DNA transposons, and MER2 and MER6 are classified as the Mariner superfamily of DNA transposons. In contrast, MER4 was revealed to be comprised of LTRs of endogenous retroviruses (ERVs) [1]. Right now, Repbase keeps MER1 to MER136, some of which are further divided into several subfamilies. Based on sequence and structural similarities to transposable elements (TEs) reported from other organisms, other MER families have also been classified as solo-LTRs of ERVs, non-autonomous DNA transposons, short interspersed elements (SINEs), and even fragments of long interspersed elements (LINEs). Problems in classification also appear with recently reported ancient repeat sequences designated as “Eutr” (eutherian transposon), “EUTREP” (eutherian repeat), “UCON” (ultraconserved element), and “Eulor” (euteleostomi conserved low frequency repeat) [4, 5]. In general, the older the repeat is, the harder it is to classify. One reason for this pattern is the inevitable uncertainty of some ancient, highly fragmented repeats at the time of discovery and characterization.

Recent analyses of repeat sequences have accumulated evidence that repeat sequences contributed to human evolution by becoming functional elements, such as protein-coding regions and binding sites for transcriptional regulators [6, 7]. Due to the rapid amplification of nearly identical copies with the potential to be bound by transcriptional regulators, TEs are proposed to rewire regulatory networks [8,9,10].

Another line of evidence for the contribution of TEs comes from conserved noncoding elements (CNEs), which were characterized via the comparison of orthologous loci from diverse vertebrate genomes. CNEs at different loci sometimes show substantial similarity to one another and to some TEs [11], indicating that at least some of these CNE “families” correspond to ancient families of TEs. Xie et al. [11] reported 96 such CNE families, including those related to MER121, LF-SINE, and AmnSINE1. It was revealed that ancient repeats have been concentrated in regions whose sequences are well conserved [5]. However, resolving the origins of these repeat sequences is a challenge because of their age, divergence and degradation.

This article summarizes our current knowledge about the human repeat sequences that are available in Repbase. The map, showing the positions of repeats in the reference genome, the human genome sequence masked with the human repeat sequences in Repbase, and the copy number and the coverage length of each repeat family are available at http://www.girinst.org/downloads/repeatmaskedgenomes/. It is noteworthy that despite our continuous efforts, most ancient repeat sequences remain unclassified into any group of TEs (Table 1).

Table 1 Ancient repeat sequences not classified yet

Repbase and RepeatMasker

RepeatMasker (http://www.repeatmasker.org/) and Censor [12] are the two most widely used tools for detecting repeat sequences in genomes of interest. These tools use sequence similarity to identify repeat sequences with the use of a prepared repeat library. The repeat library used by RepeatMasker is basically a repacked Repbase that is available at the Genetic Information Research Institute (GIRI) website (http://www.girinst.org/repbase). Censor is provided by GIRI itself and can use the original Repbase. The RepeatMasker edition of Repbase is released irregularly (once a year in the last 5 years), while the original Repbase is updated monthly. However, there are some minor discrepancies between Repbase and the RepeatMasker edition. These differences are caused by independent updates of repeat sequences and their annotations in both databases. These updates are seen especially for human repeats. These discrepancies include different names for the same repeats. For example, MER97B in Repbase is listed as MER97b in the RepeatMasker edition, MER45 in Repbase is found as MER45A in the RepeatMasker edition, and MER61I in Repbase is found as MER61-int in the RepeatMasker edition. In some cases, the corresponding sequences may have less than 90% sequence identity due to independent sequence updates. The MER96B sequences in the two databases are only 89% identical. The consensus sequences of the L1 subfamilies are divided into several pieces (“_5end,” which includes the 5’ UTR and ORF1, “_orf2,” which corresponds to ORF2, and “_3end,” which corresponds to the 3’ UTR) in the RepeatMasker edition to improve the sensitivity of detection.

This article does not aim to eliminate such discrepancies. Instead, some consensus sequences that were found only in the RepeatMasker edition previously were added to Repbase. In this article, all sequence entries are based on Repbase, but if those entries have different names in the RepeatMasker edition, these names are also shown in parentheses in the included Tables.

TE classification in Repbase

Eukaryotic transposable elements are classified into two classes: Class I and Class II. Class I is comprised of retrotransposons, which transpose through an RNA intermediate. Class II is comprised of DNA transposons, which do not use RNA as a transposition intermediate. In other words, Class I includes all transposons that encode reverse transcriptase and their non-autonomous derivatives, while Class II includes all other autonomous transposons that lack reverse transcriptase and their non-autonomous derivatives. Another important piece of information is that the genomes of prokaryotes (bacteria and archaea) do not contain any retrotransposons.

Repbase currently classifies eukaryotic TEs into three groups: Non-LTR retrotransposons, LTR retrotransposons and DNA transposons [13] (Table 2). Non-LTR retrotransposons and LTR retrotransposons are the members of Class I TEs. To simplify the classification, some newly described groups are placed in these three groups. The “Non-LTR retrotransposons” include canonical non-LTR retrotransposons that encode apurinic-like endonuclease (APE) or/and restriction-like endonuclease (RLE), as well as Penelope-like elements (PLE) that encode or do not encode the GIY-YIG nuclease. These non-LTR retrotransposons share a transposition mechanism called “target-primed reverse transcription (TPRT),” in which the 3’ DNA end cleaved by the nuclease is used as a primer for reverse transcription catalyzed by the retrotransposon-encoding reverse transcriptase (RT) [14]. Non-LTR retrotransposons are classified into 32 clades. Short interspersed elements (SINEs) are classified as a group of non-LTR retrotransposons in Repbase. SINEs are composite non-autonomous retrotransposons that depend on autonomous non-LTR retrotransposons for mobilization [15, 16]. SINEs are classified into four groups based on the origins of their 5′ regions [17].

Table 2 TE classification in Repbase

LTR retrotransposons are classified into five superfamilies (Copia, Gypsy, BEL, DIRS and endogenous retrovirus (ERV)), and the ERV superfamily is further subdivided into five groups (ERV1, ERV2, ERV3, ERV4 and endogenous lentivirus). Except for the DIRS retrotransposons, these LTR retrotransposons encode DDE-transposase/integrase for the integration of cDNA, which is synthesized in the cytoplasm by the retrotransposon-encoding RT. The RT encoded by LTR retrotransposons uses tRNA as a primer for reverse transcription. The DDE-transposase/integrase of LTR retrotransposons resembles the DDE-transposase seen in DNA transposons, especially IS3, IS481, Ginger1, Ginger2, and Polinton [18]. DIRS retrotransposons, on the other hand, encode a tyrosine recombinase (YR), which is related to the YRs encoded by Crypton DNA transposons [19].

DNA transposons include very diverse groups of TEs. Repbase currently uses 23 superfamilies for the classification of DNA transposons. Most TE superfamilies encode DDE transposase/integrase [20], but Crypton and Helitron encode the YR and HUH nucleases, respectively [21, 22]. Polinton encodes a DDE transposase that is very closely related to the LTR retrotransposons, Ginger1, and Ginger2, but Polinton is an extremely long TE encoding DNA polymerase B and some structural proteins [18, 23]. Polinton was recently reported as an integrated virus designated Polintovirus, based on the identification of the coding regions for the minor and the major capsid proteins [24].

Non-LTR retrotransposons

Only three groups of non-LTR retrotransposons are active in the human genome: L1 (long interspersed element-1 (LINE-1)), Alu and SVA (SINE-R/VNTR/Alu). Thanks to their recent activity, these retrotransposons can be classified into many subfamilies based on sequence differences (Table 3). The classification and evolution of these groups is well described in several articles [25,26,27,28]; thus, these three groups are introduced briefly here.

Table 3 Non-LTR retrotransposons (LINEs, SINEs, and composites)

L1 is the only active autonomous non-LTR retrotransposon in the human genome. L1 encodes two proteins called ORF1p and ORF2p. ORF1p is the structural protein, corresponding to Gag proteins in LTR retrotransposons and retroviruses. ORF2p includes domains for endonuclease and reverse transcriptase, as well as a DNA-binding CCHC zinc-finger motif. L1 mobilizes not only its own RNA but also other RNAs that contain 3′ polyA tails. Thus, the presence of L1 corresponds to an abundance of processed pseudogenes, which are also called retrocopies or retropseudogenes [29]. Alu and SVA transpose in a manner dependent on the L1 transposition machinery [15, 30, 31]. L1 is present in most mammals, but some mammals, such as megabats, have lost L1 activity [32].

Based on their age and distribution, L1 lineages are classified as L1P (primate-specific) and L1M (mammalian-wide). These groups are further sub-classified into various subfamilies (Table 3). L1PA1 (L1 and L1HS in Repbase correspond to this subfamily) is the only active L1 subfamily in the human genome. During the evolution of L1, the 5′ and 3′ untranslated regions (UTRs) were replaced by unrelated sequences [27]. These replacements sometimes saved L1 from restriction by KRAB-zinc finger proteins [33].

HAL1 (half L1) is a non-autonomous derivative of L1 and encodes only ORF1p [34]. HAL1s originated independently several times during the evolution of mammals [35].

The majority of Alu is composed of a dimer of 7SL RNA-derived sequences. Dimeric Alu copies in the human genome are classified into three lineages: AluJ, AluS and AluY, among which AluY is the youngest lineage [36]. Older than AluJ are monomeric Alu families, which can be classified into 4 subfamilies: FAM, FLAM-A, FLAM-C and FRAM [37]. FLAM-A is very similar to PB1 from rodents; thus, Repbase does not include FLAM-A. FLAM in Repbase corresponds to FLAM-C. 7SL RNA-derived SINEs are called SINE1. SINE1 has been found only in euarchontoglires (also called supraprimates), which is a mammalian clade that includes primates, tree shrews, flying lemurs, rodents, and lagomorphs [38]. The close similarity between FLAM-A and PB1 indicates their activity in the common ancestor of euarchontoglires, and the lack of SINE1 outside of euarchontoglires indicates that SINE1 evolved in the common ancestor of euarchontoglires after their divergence from laurasiatherians. In rodents, no dimeric Alu has evolved. Instead, B1, which is another type of derivative of PB1, has accumulated. The genomes of tree shrews contain composite SINEs that originated from the fusion of tRNA and 7SL RNA-derived sequences [39].

Several Alu subfamilies are transposition-competent. The two dominant Alu subfamilies that show polymorphic distributions in the human population are AluYa5 and AluYb8. AluYa5 and AluYb8 correspond to approximately one-half and one-quarter of human Alu polymorphic insertions, respectively [40]. AluYa5 and AluYb8 have accumulated 5 and 8 nucleotide substitutions, respectively, from their ancestral AluY, which remains active and occupies ~15% of the polymorphic insertions. Until recently, all active Alu elements were believed to be AluY or its descendants [40]. However, a recent study revealed that some AluS insertions are polymorphic in the human population, indicating that some AluS copies are or were transposition-competent [41]. Monomeric Alu families are older than dimeric Alu families, but monomeric Alu families also show species-specific distributions in the great apes [37]. Monomeric Alu insertions have been generated via two mechanisms. One mechanism is recombination between two polyA tracts to remove the right monomer of dimeric Alu, and the other mechanism is the transposition of a monomeric Alu copy. BC200, which is a domesticated Alu copy [42], is the main contributor to the latter mechanism, but at least one other monomeric Alu copy also contributed to the generation of new monomeric Alu insertions [37].

SVA is a composite retrotransposon family, whose mobilization depends on L1 protein activity [30, 31]. Two parts of SVA originated from Alu and HERVK10, which is consistent with the younger age of SVA than Alu and HERVK10 [43]. The other parts of SVA are tandem repeat sequences: (CCCTCT) hexamer repeats at the 5′ terminus and a variable number of tandem repeats (VNTR) composed of copies of a 35–50 bp sequence between the Alu-derived region and the HERVK10-derived region. SVA is found only in humans and apes. Gibbons have three sister lineages of SVA, which are called LAVA (L1-Alu-VNTR-Alu), PVA (PTGR2-VNTR-Alu) and FVA (FRAM-VNTR-Alu) [44, 45]. These three families share the VNTR region and the Alu-derived region but exhibit different compositions.

SVA in hominids (humans and great apes) is classified into 6 lineages (SVA_A to SVA_F), and SVA_F is the youngest lineage [43]. The three youngest subfamilies, SVA_F, SVA_E and SVA_D, contribute to all known polymorphic SVA insertions in the human genome. Recently, another human-specific SVA subfamily was found, and this subfamily has recruited the first exon of the microtubule-associated serine/threonine kinase 2 (MAST2) gene [46,47,48]. The master copy of this human-specific subfamily is presumed to be inserted in an intron of the MAST2 gene and is transcribed in a manner dependent on MAST2 expression in some human individuals, although it is not present in the human reference genome. An SVA_A-related subfamily was recently found in the Northern white-cheeked gibbon (Nomascus leucogenys) and was designated as SVA NLE [45].

In addition to the sequences described above, the human genome contains many signs of the ancient activity of non-LTR retrotransposons belonging to L2, CR1, Crack, RTE, RTEX, R4, Vingi, Tx1 and Penelope (Table 3). With the rapid increase of information about repeats in other vertebrate genomes, TEs from other vertebrates occasionally provide clues about the origin of human repeat sequences. One recently classified example is UCON82, which exhibits similarity to the 3′ tails of vertebrate RTE elements from coelacanth (RTE-2_LCh), crocodilians (RTE-2_Croc) and turtle (RTE-30_CPB) (Fig. 1a). The characterization of L2-3_AMi from the American alligator Alligator mississippiensis revealed the L2 non-LTR retrotransposon-like sequence signatures in UCON49 and UCON86.

Fig. 1
figure 1

Nucleotide sequence alignments of ancient repeats with characterized TEs. Nucleotides identical to the uppermost sequence are shaded. Numbers in parentheses indicate the nucleotide position in the consensus. a UCON82 is an RTE non-LTR retrotransposon family. b UCON39 is an ancient Mariner DNA transposon family. c Eulor5 and Eulor6 are ancient Crypton DNA transposon families

These groups of non-LTR retrotransposons are also found in several mammals or amniotes, supporting their past activity. L2 is the dominant family of non-LTR retrotransposons in the platypus genome [49]. The diversification of CR1 is a trademark of bird genomes [50]. Active RTE was found in various mammals and reptiles and is represented by Bov-B from bovines [51, 52]. L4 and L5 were originally classified as RTE, but the reanalysis revealed that these sequences are more closely related to RTEX. Non-LTR retrotransposons belonging to the R4 clade were reported in the anolis lizard [53]. Vingi was reported in hedgehogs and reptiles [54]. Some sequence-specific non-LTR retrotransposons belonging to Tx1 are reported in crocodilians [17]. Crack and Penelope have not been reported in any amniotes. On the other hand, R2, which is a non-LTR retrotransposon lineage that is distributed widely among animals [55], is not found in any mammalian genomes.

The human genome also contains many ancient SINE insertions, such as MIRs or DeuSINEs [56,57,58]. It is known that MIRs exhibit sequence similarity to L2 in their 3′ regions, indicating that MIRs were transposed in a manner dependent on the transposition machinery of L2 [49]. MER131 is considered to be a SINE because it ends with a polyA tail. As shown in many reports [6, 59], some of these insertions have been exapted to function as promoters, enhancers or other non-coding functional DNA elements.

LTR retrotransposons

The group of LTR retrotransposons in the human genome is primarily endogenous retroviruses (ERVs) (Table 4). ERV1, ERV2 and ERV3 are all found in the human genome, but the recently recognized ERV4 has not been detected [60]. Neither the endogenous lentivirus nor the endogenous foamy virus (Spumavirus) was found. Some traces of Gypsy LTR retrotransposons have also been found, and this finding is consistent with the domesticated Gypsy (Sushi) sequences in peg10 and related genes [61]. There are no traces of the Copia, BEL or DIRS retrotransposons in the human genome [62], except for the two genes encoding DIRS-derived protein domains: Lamin-associated protein 2 alpha isoform (LAP2alpha) and Zinc finger protein 451 (ZNF451) [63]. BEL and DIRS are found in the anolis lizard genome but have not been detected in bird genomes [62]. Mammalian genomes contain only a small fraction of Gypsy LTR retrotransposons, and it is speculated that during the early stage of mammalian evolution, LTR retrotransposons lost their competition with retroviruses.

Table 4 LTR retrotransposons and endogenous retroviruses

Historically, human ERVs have been designated with “HERV” plus one capital letter, such as K, L or S. Difficulty in classifying ERV sequences is caused by (1) the loss of internal sequences via the recombination of two LTRs and (2) the high level of recombination between different families. Different levels of sequence conservation between LTRs and the internal portions between LTRs increases this complexity. Recently, Vargiu et al. [64] systematically analyzed and classified HERVs into 39 groups. Here, the relationship between the classification reported by Vargiu et al. and the consensus sequences in Repbase is shown (Table 4). Unfortunately, it is impossible to determine all LTRs or internal sequences in Repbase using the classification system reported by Vargiu et al. [64]. Thus, in this review, 22 higher classification ranks in Vargiu et al. [64] are used, and many solo-LTRs are classified as the ERV1, ERV2, ERV3 and Gypsy superfamilies. The numbers of copies for each ERV family in the human genome are available elsewhere, such as dbHERV-REs (http://herv-tfbs.com/), and thus, the abundance or the phylogenetic distribution of each family is not discussed in this review.

ERV1 corresponds to Gammaretroviruses and Epsilonretroviruses. In the classification scheme outlined by Vargiu et al. [64], only HEPSI belongs to Espilonretrovirus. In addition, one subgroup of HEPSI, HEPSI2, may represent an independent branch from other HEPSIs and may be related to the retrovirus-derived bird gene Ovex1 [65]. Endogenous retroviruses related to Ovex1 were found in crocodilians [60]. Several MER families and LTR families (MER31A, MER31B, MER49, MER65, MER66 (MER66A, MER66B, MER66C, MER66D and MER66_I linked with MER66C), MER87, MER87B, HERV23, LTR23, LTR37A, LTR37B, and LTR39) are reported to be related to MER4 (MER4 group).

ERV2 was classified into 10 subgroups by Vargiu et al. [64]. All of these subgroups belong to the lineage Betaretrovirus. No ERV2 elements closely related to Alpharetrovirus were detected. HERVK is the only lineage of ERVs that has continued to replicate within humans in the past few million years [66], and this lineage exhibits polymorphic insertions in the human population [67].

ERV3 was historically considered to be the endogenous version of Spumavirus (foamy virus); however, the recent identification of true endogenous foamy viruses (SloEFV from sloth, CoeEFV from coelacanth and ERV1-2_DR from zebrafish) revealed that ERV3 and Spumavirus are independent lineages [168, 69]. The ERVL lineage of the ERV3 families encodes a dUTPase domain, while the ERVS lineage lacks dUTPase. The distribution of ERVL- and ERVS-like ERVs in amniotes indicates that at least two lineages of ERV3 have evolved in mammalian genomes [60].

There are many recombinants between different ERV families. HARLEQUIN is a complex recombinant whose structure can be expressed as LTR2-HERVE-MER57I-LTR8-MER4I-HERVI-HERVE-LTR2. HERVE, HERVIP10F, and HERV9 are the closest in sequence to HARLEQUIN, indicating that these three ERV1 families are the components that construct HARLEQUIN-type recombinant ERVs. HERVE, HERVIP10 and HERV9 are classified as HERVERI, HERVIPADP and HERVW9, respectively, in Vargiu et al. [64]. Recombinants between different families or lineages makes the classification very difficult. The extremes of recombination are the recombinants between two ERVs belonging to ERV1 and ERV3. Such recombination generates ERV1-like envelope protein-encoding ERV3 families, although most mammalian ERV3 families lack envelope protein genes. HERV18 (HERVS) and the related HERVL32 and HERVL66 are such recombinants.

DNA transposons

As shown by Pace and Feschotte [70], no families of DNA transposons are currently active in the human genome. During the history of human evolution, two superfamilies of DNA transposons, hAT and Mariner, have constituted a large fraction of the human genome (Table 5). Autonomous hAT families are designated as Blackjack, Charlie, Cheshire, MER69C (Arthur) and Zaphod. Many MER families are now classified as non-autonomous hAT transposons. The Mariner DNA transposons that contain at least a portion of a protein coding region are Golem (Tigger3), HsMar, HSTC2, Kanga, Tigger, and Zombi (Tigger4). Some recently characterized repeat sequence families designated with UCON or X_DNA have also been revealed to be non-autonomous members of hAT or Mariner. For example, the alignment with Mariner-N12_Crp from the crocodile Crocodylus porosus revealed that UCON39 is a non-autonomous Mariner family and the first two nucleotides (TA) in the original consensus of UCON39 are actually a TSD (Fig. 1b). The characterization of hAT-15_CPB from the western painted turtle Chrysemys picta bellii led to the classification of Eutr7 and Eutr8 as hAT DNA transposons because those sequences exhibit similarity in the termini of hAT-15_CPB. Based on sequence similarity and age distribution [28], it is revealed that autonomous DNA transposon families have a counterpart: non-autonomous derivative families. MER30, MER30B and MER107 are the derivatives of Charlie12. MER1A and MER1B originated from CHARLIE3. TIGGER7 is responsible for the mobilization of its non-autonomous derivatives, MER44A, MER44B, MER44C and MER44D.

Table 5 DNA transposons

In addition to these two dominant superfamilies, small fractions of human repeats are classified into other DNA transposon superfamilies (Table 5). These repeats are Crypton (Eulor5A, Eulor5B, Eulor6A, Eulor6B, Eulor6C, Eulor6D and Eulor6E), Helitron (Helitron1Nb_Mam and Helitron3Na_Mam), Kolobok (UCON29), Merlin (Merlin1-HS), MuDR (Ricksha), and piggyBac (Looper, MER75 and MER85). A striking sequence similarity was found between Crypton elements from salmon (Crypton-N1_SSa and CryptonA-N2_SSa) and Eulor5A/B and Eulor6A/B/C/D/E, especially at the termini (Fig. 1c). They are the first Eulor families classified into a specific family of TEs and also the first finding of traces of Cryptons in the human genome, except for the 6 genes derived from Cryptons [71].

Like Crypton-derived genes, some human genes exhibit sequence similarity to DNA transposons, which have not been characterized in the human genome. The identification of these “domesticated” genes reveals that some DNA transposons inhabited the human genome in the past. Ancient Transib was likely the origin of the rag1 and rag2 genes that are responsible for V(D)J recombination [72,73,74]. THAP9 has a transposase signature from a P element and retains transposase activity [75]. harbi1 is a domesticated Harbinger gene [76]. rag1, rag2 and harbi1 are conserved in all jawed vertebrates. Gin-1 and gin-2 show similarity to Gypsy LTR retrotransposons, as well as Ginger2 DNA transposons, but are the most similar to some Ginger1 DNA transposons from Hydra magnipapillata [18]. Therefore, although the traces of 4 superfamilies of DNA transposons (Transib, P, Harbinger, and Ginger1) have not found as repetitive sequences in the human genome, they have contributed to human genome evolution by serving protein-coding sequences.

Genomic traces of human evolution

Several families of TEs are still active in the human population. L1PA1, SVA and several AluY subfamilies show polymorphism in the human population, indicating their recent activity [40, 77]. Another type of evidence for the current activity of these TEs are the somatic insertions seen in brains and cancer cells [78, 79]. HERVK is the only lineage of ERVs exhibiting polymorphic insertions in the human population [67].

On the other hand, human repeats have accumulated during the whole history of human evolution. These repeats are certainly not restricted to the human genome but are shared with the genomes of many other mammals, amniotes, and vertebrates. Almost all TE families are shared between humans and chimpanzees. An exception is the endogenous retrovirus family PtERV1, which is present in the genomes of chimpanzees and gorillas but not humans [80]. The human TRIM5alpha can prevent infection by PtERV1, and this can be the reason why PtERV1 is absent in the human genome [81]. Sometimes, TE families that ceased transposition long ago in the human lineage have been active to mobilize in another lineage. The Crypton superfamily of DNA transposons were active in the common ancestor of jawed vertebrates, judging from the distribution of orthologous Crypton-derived genes [71]. Eulor5A/B and Eulor6A/B/C/D/E are shared among euteleostomi including mammals to teleost fishes and show similarity to two non-autonomous Crypton DNA transposons from salmon (Fig. 1c). Copies of Crypton-N1_SSa are over 94% identical to their consensus sequence, and copies of CryptonA-N2_SSa are around 90% identical to their consensus sequence. The autonomous counterpart of these two salmon Crypton DNA transposons may be the direct descendants of the ancient Crypton DNA transposon that gave birth to Eulor5A/B and Eulor6A/B/C/D/E. UCON39 is conserved among mammals and shows similarity to the crocodilian DNA transposon family Mariner-N12_Crp (Fig. 1b). The distribution of these two families indicates that they are the sister lineages sharing the common ancestor. Copies of Mariner-N12_Crp are only around 82% identical to their consensus. Considering the low substitution rate in the crocodilian lineage, Mariner-N12_Crp also ceased to transpose a very long ago. These examples clarify the contribution of TEs to the human genome components. They also highlight the importance of characterizing TE sequences from non-human animals in understanding the human genome evolution.

As represented by names such as EUTREP (eutherian repeat) or Eulor (euteleostomi conserved low frequency repeat), different repeat families are shared at different levels of vertebrate groups. Jurka et al. [5] reported 136 human repeat families that are not present in the chicken genome and 130 human repeat sequences that are also present in the chicken genome. These two sets of families likely represent ancient TE families that expanded in the common ancestor of mammals and ancient TE families that expanded in the common ancestor of amniotes, respectively. Based on the carrier subpopulation (CASP) hypothesis we proposed, these TE insertions were fixed by genetic drift after population subdivision [82]. These insertions may have resulted in reduced fitness of the host organism, but it can allow the organism to escape from evolutionary stasis [83]. Once TE insertions were fixed, mutations should have accumulated to increase fitness. Increasing fitness is usually through the elimination of TE activity and the removal of TE insertions. However, some TE insertions have acquired function beneficial to the host. Indeed, ancient repeats have been concentrated in regions whose sequences are well conserved [5]. They are expected to have been exapted to have biological functions as enhancers, promoters, or insulators.

More direct evidence for the ancient transposition of TEs is seen in domesticated genes. rag1, rag2, harbi1, and pgbd5 (piggyBac-derived gene 5) are conserved in jawed vertebrates. The most ancient gene that originated from a certain TE superfamily is a Crypton seen in the woc/zmym genes [71]. Four genes, zmym2, zmym3, zmym4 and qrich1, were duplicated by two rounds of whole genome duplication in the common ancestor of vertebrates and represent the orthologs of woc distributed in bilaterian animals. Unfortunately, this level of conservation is unlikely to be present in non-coding sequences derived from TEs; however, over 6500 sequences are reported to be conserved among chordates, hemichordates and echinoderms [84]. Researchers are more likely to find traces of ancient TEs when analyzing slowly evolving genomes, such as crocodilians [85].

Conclusions

Nearly all repeat sequences in the human genome have likely been detected. The current challenge is the characterization of these repeat sequences and their evolutionary history. This characterization is one objective of the continuous expansion of Repbase. Repbase will continue to collect repeat sequences from various eukaryotic genomes, which will help to uncover the evolutionary history of the human genome.