New Insights on the Evolution of Genome Content: Population Dynamics of Transposable Elements in Flies and Humans

Understanding the abundance, diversity, and distribution of TEs in genomes is crucial to understand genome structure, function, and evolution. Advances in whole-genome sequencing techniques, as well as in bioinformatics tools, have increased our ability to detect and analyze the transposable element content in genomes. In addition to reference genomes, we now have access to population datasets in which multiple individuals within a species are sequenced. In this chapter, we highlight the recent advances in the study of TE population dynamics focusing on fruit ﬂies and humans, which represent two extremes in terms of TE abundance, diversity, and activity. We review the most recent methodological approaches applied to the study of TE dynamics as well as the new knowledge on host factors involved in the regulation of TE activity. In addition to transposition rates, we also focus on TE deletion rates and on the selective forces that affect the dynamics of TEs in genomes.


Transposable Elements Are Abundant and Active Genome Denizens
Transposable elements (TEs) are short DNA sequences, typically from a few hundred bp to~10 kb long, which have the ability to move around in the genome by generating new copies of themselves. In addition to active autonomous elements, genomes also contained nonautonomous elements that can be mobilized by the enzymatic machinery of active TEs from the same family. Additionally, genomes contain TEs that cannot be mobilized anymore due to accumulation of mutations in their sequences [1]. TEs are an ancient, extremely diverse, and exceptionally active component of genomes. TEs have been found in virtually all organisms studied so far including bacteria, archaea, fungi, protists, plants, and animals [2][3][4][5]. The main TE groups, class I and class II, are present in all kingdoms, revealing their persistence over evolutionary time [2]. These two classes of TEs differ in their transposition intermediates: while class I TEs transpose through RNA intermediates, class II TEs transpose directly as DNA. TEs within each class are further classified into (1) different orders, based on their insertion mechanism, structure, and encoded proteins; (2) different superfamilies, based on their replication strategy and on presence and size of target site duplications; and (3) different families, based on sequence conservation [2,3]. Piegu et al. [1] criticized the current classification system, which accounts for sequence homology, structural features, and target site duplications, because it does not always take into account the evolutionary origins of the TEs [1][2][3]. As a consequence, phylogenetically unrelated classes or subclasses of TEs are grouped [1]. Piegu et al. [1] also suggested that a more inclusive classification that includes prokaryotic and eukaryotic TE classes should be considered. Recently, Arkhipova [6] proposed a TE classification system based on the replicative, integrative, and structural components of TEs, which integrates different aspects of all the existing classification systems [6].
TEs constitute a substantial albeit variable (from~1% to almost 90%) proportion of genomes [7,8] (Fig. 1). The identification methods, as well as the sequencing and assembly methods, have an important effect in the TE content estimation [4,[9][10][11]. In some cases, the TE-generated fraction of genomes is likely to be underestimated because methods for detecting TEs in genomic sequences are necessarily biased toward younger and more easily recognizable TEs. Indeed, new tools developed in recent years are able to identify TEs that remained hidden until now [4,11]. As an example, when the human genome was first sequenced,~40-45% of the genome was identifiable TEs, 5% was genes and other functional sequences (functional RNAs or regulatory regions), and the remaining~50% of the genome had no identifiable origin [12]. de Koning et al. [13] using a highly sensitive new strategy named P-cloud found that at least 66-69% of the human genome is identifiable as repetitive sequences, most of them derived from TEs [13]. In Drosophila melanogaster, third-generation sequencing techniques (3GS) have allowed the detection of 37% more TE insertions in chromosome 2L compared to previously available short-read sequencing estimates (see below) [14]. In other Drosophila species such as D. buzzatii, the TE content has also been updated from 6% to 11%, thanks to the recent availability of wholegenome sequences [15].
Overall, recent advances in sequencing technologies and in TE detection methods showed that, as expected, the TE content is higher than previously estimated. These new data also provided further evidence for the impact of TEs in genome function and genome structure. Thus, it is still indisputable that a thorough understanding of TE population dynamics is essential for the understanding of the eukaryotic genome structure, function, and evolution.

Drosophila and Humans: Two Extremes in TE Diversity and Population Dynamics
Much of the detailed information on TE evolution still comes from two species with the best-studied genomes: fruit flies (D. melanogaster) and humans. Fortunately, these two genomes represent two extremes in terms of TE diversity and population dynamics and thus give a reasonably diverse picture of the TE evolution and dynamics. For the rest of this chapter, we focus ä Fig. 2 (continued) inserted in the exon of a gene can lead to exonization of the TE or to transcript truncation; (5) the whole domain of a TE protein could insert in the coding region of a gene generating a chimeric gene with host and TE domains [5,21]. In addition to these changes that depend on where the TE is inserted and on the sequences that the TE is adding, TEs can also alter the posttranslational modifications of histones. (b) TEs could also induce translation repression by generating secondary structure in the 3 0 UTR of genes that leads to changes in the localization of the mRNA. This secondary structure could bind to one of the protein components of paraspeckle (P54 nrb ) and translocate to paraspeckle, a group of subnuclear bodies, avoiding moving out of the nucleus. However, the same secondary structure could bind to the dsRNA-binding protein Staufen 1 (STAU1) and in this case translocate to cytoplasm. Once in the cytoplasm, the secondary structure could bind to STAU1 again allowing translation, but under some situations mRNA could bind to the ds-RNA-dependent protein kinase (PKR) repressing translation [23]. (c) Ectopic recombination between TE copies (green boxes with yellow arrows) in the same orientation can lead to deletions when recombination takes place between copies located on the same chromatid (1) or deletions and duplications when recombination takes place between copies in different chromosomes (2) (recombination between two nonhomologous chromosomes should lead to a translocation). Ectopic recombination between TE copies in opposite orientation leads to inversion of the DNA between the two TEs (3) primarily on these two genomes and will highlight the similarities and differences observed between them.
As mentioned above, the human reference genome has millions of TE copies, with 66-69% of the genome mostly derived from TE sequences [13]. Two human retrotransposable element (class I) families, LINE1 (L1, long interspersed nuclear element 1) and Alu, account for 60% of all interspersed repeat sequences. The vast majority of the TEs in the human genome are fixed, and most families are inactive. However, some elements of the main families of human endogenous retrovirus (HERV-K) and LINE1 elements show autonomous transposition. Meanwhile, elements of Alu and the hybrid SVA elements formed by SINEs (short interspersed nuclear elements), VNTRs (variable number tandem repeat), and Alus show nonautonomous activity [64][65][66].
In contrast, the fruit fly D. melanogaster reference genome contains only thousands of individual TE copies (5416 TE copies in FlyBase R6.04) accounting for only~5.5% of the euchromatin [67]. If the missing percentage of TEs detected in chromosome 2L is similar in other chromosomes, the euchromatin TE content might be higher (~8.7%) [14]. If heterochromatin is also included, TEs account for 11-20% of the D. melanogaster genome [68,69]. D. melanogaster TEs belong to approximately 100 diverse families of both class I and class II elements [69,70]. Each family consists of 1-304 copies with no dominant family corresponding to the majority of TEs. The only exception is INE-1 family that contains~2000 copies and has been inactive for the past~3--4.6 million years [71][72][73]. The majority of TE families are considered to be active in Drosophila: individual TE copies are generally polymorphic in the population and show a high sequence similarity [69,70,74,75]. Indeed, there is experimental evidence showing that Gypsy and ZAM elements are active [76,77]. Besides, there is indirect evidence for the activity of 24 D. melanogaster superfamilies based on a whole-genome sequencing experiment of mutation accumulation lines [75] (Table 1).
Why do these two genomes differ so profoundly in content, diversity, and activity of TEs? The answer must lie in different aspects of TE population dynamics within genomes and forces that lead to varying rates of TE family birth and extinction. In the rest of this review, we focus on the state of knowledge of different aspects of TE population dynamics and discuss aspects of TE family evolution. Specifically, we focus on rates of TE transposition, fixation, or loss in human and D. melanogaster populations due to stochastic forces and natural selection for or against TE insertions and forces that affect coexistence of multiple TE families and the standing diversity of TE types (Fig. 3).

Methodology Used to Study TE Population Dynamics
TE dynamics continues to be studied using three main approaches: mathematical modeling, computer simulations, and the analysis of empirical data. Often a combination of these approaches is used to better understand TE abundance, diversity, and distribution ( Table 2). Le Rouzic et al. [78] applied the statistical framework originally developed to infer speciation and extinction dynamics in species phylogenies to reconstruct the evolutionary history of TEs [78]. The model allows to estimate and to interpret the pattern of transposition activity that results in different TE copy number distributions [78]. The authors also performed computer simulations to provide reference dynamics that aid in the interpretation of the results obtained (Table 2). Traditionally, mathematical models considered the relationship between the host and a homogenous group of active TEs. However, the TE content of any genome is a mixed of autonomous and nonautonomous insertions. Xue and Goldenfeld [79] proposed a mathematical model that considers the relationship between nonautonomous and autonomous TEs as a predator-prey dynamic. Unlike previous models that also use the analogy to ecological models, Xue and Goldenfeld model takes into account the  [156] molecular level interactions between transposable elements and the small copy number of the active transposons. The model predicts oscillations in the number of TEs in a time scale much longer than the cell replication time, suggesting that the genome stores the predator-prey state during successive generations [79]. TE dynamics have also been analyzed in variable environments [80,81] (Table 2). Gogolesky et al. [81] proposed a stochastic computational model to analyze the dynamics of active TEs in genomes of sexual diploid organisms under environmental stress. They based their model in the Fisher geometrical model of fitness landscapes. Overall, the authors conclude that the presence of inactive copies of TEs is necessary for the transposition-selection equilibrium of autonomous copies and that the mutator capacity of TEs might be important when host populations face rapid environmental changes [81].
Other recently developed methods analyzed the influence of the mating system in TE dynamics, different modes of selection, or applied branching models for studying the propagation of particular TE classes [82][83][84] (Table 2).
In addition to mathematical modeling and simulations, multiple computational tools have been developed to analyze TEs in   sequenced genomes in the last 5 years. While some of these tools aimed at assessing the global abundance and diversity of TEs in the genome, such as dnaPipeTE, or to annotate TEs in assembled genomes, such as REPET, most of them are focused on discovering and/or genotyping individual copies of TEs in the genome using next-generation sequencing (NGS) data [11,64,[85][86][87][88][89][90]. The diversity of methods available makes it difficult to choose the most appropriate one for the analyses of a given genome. To try The model studies the evolutionary behavior of TE copy number and the molecular evolution of their DNA sequences

TEs in sexual diploid populations
The model predicts that weak selection allows high copy numbers of TEs most of them inactive copies, while strong selection reduces the number of TEs but increases the proportion of active copies. Regarding TE sequences, the model shows that the phylogeny of these sequences allows distinguishing active copies from non-and less active copies [83] The model analyzes the propagation of LTR TEs by taking into account the TE position in the chromosome, the degradation level of the TEs, and the duplication rate that varies with the degradation level roo, Gypsy and DM412,

TEs of LTR family from Drosophila melanogaster
The simulation estimates several parameters affecting the propagation of TEs and identifies the initial copy from which three LTR families have spread on the euchromatin part of the 3L chromosome [84] to overcome this limitation, Nelson et al. [91] developed an integrated pipeline named McClintock that incorporates six complementary TE detection methods. McClintock generates standardized output for the different TE detection methods, thus facilitating the comparison of the results obtained with the different pipelines, as well as facilitating their installation and use [91]. This and other studies that compared the performance of several tools arrived to the same conclusion: several computational tools should be combined to increase the accuracy of TE analysis [64,86,91]. The availability of third-generation sequencing techniques (3GS) should help improve the detection and genotyping of TE insertions. Although 3GS was developed before 2010 [92], it has only been in the last few years when this technique has started to be used [14,93]. Chakraborty et al. [14] reported the assembly of a D. melanogaster genome from a Zimbabwe strain using long-read single molecule real-time sequencing with 147X coverage. Among several novel structural variants described, they identified 37% additional TE insertions in the 2L chromosome compared with a previous study that used 70X coverage of short reads [14,94]. 3GS technologies have also been applied to the sequencing of human genomes, although a detailed analysis of TE content based on longread data has not been performed yet [95][96][97].
Recently, Disdero and Filée [98] introduced the first tool that uses long-read sequences to identify TE insertions in the D. melanogaster genome: LoRTE [98]. The authors argue that available software based on short reads fail to correctly identify TEs that are present in highly repetitive regions of the genome, while long-read technologies should allow us to identify all TEs in a given genome. LoRTE, developed in Python, verifies presence and/or absence of previously annotated TEs and can also detect new insertions not previously annotated in the reference genome. LoRTE is able to work with low-coverage sequences (<10X) providing an efficient accurate TE annotation in a cost-effective manner [98].

Empirical Estimates of the Rates of Transposition in Drosophila and Humans
Transposition rates in D. melanogaster have been traditionally estimated empirically by in situ hybridization and by using PCR approaches. The activation of TEs following intra-and interspecific hybridization has been studied in different Drosophila species [99][100][101]. For example, Vela et al. [100] estimated transpositions rates in D. buzzatii-D. koepferae interspecific hybrid flies by in situ hybridization [100]. They found that hybrids showed at least one order of magnitude higher transposition rates than parental lines for at least three TE families [100]. Robillard et al. [102] estimated transposition rates by qPCR in an experimental evolution study in which a TE insertion was introduced in a strain lacking insertions from that particular family [102]. In the first generations after the introduction of the TE insertion, the transposition rate was 0.33-0.45 per copy per generation, while in the following generations, transposition rates were reduced at least one order of magnitude per copy per generation. These values represent the first steps in the invasion of a TE in a genome that is faster than the rate of transposition when measured in natural populations [102].
In the first edition of this chapter [103], we anticipated that NGS would allow studying transposition rates in a deeper and more accurate way. Indeed, recent studies have taken advantage of NGS data to estimate transposition rates in D. melanogaster. Rahman et al. [89] estimated using NGS data the transposition rate in the reference strain by comparing two available genomes that were sequenced with~15 years difference. The average transposition rate for TEs belonging to different families was 7 Â 10 À5 , which is on the same order of magnitude as the previously reported rates (~10 À4 -10 À5 ). Furthermore, they confirmed the prediction of increased transposition rate in inbred lines: they estimated a higher average number of TE insertions in lab strains inbred for more generations compared with strains inbred for a smaller number of generations [89]. Adrion et al. [75] estimated spontaneous insertion and deletion rates in D. melanogaster mutation accumulation lines [75]. The authors identified 24 active superfamilies and estimated genome-wide insertion rates to be higher than deletion rates: 2.11 Â 10 À9 vs. 1.37 Â 10 À10 per site per generation, respectively. Superfamily-specific rates of insertion varied from 0 to 5.13 Â 10 À3 insertions per copy per generation and were within the range of previously estimated rates [75] (Table 1).
In humans, previous studies estimated the transposition rate as in 1 in 95 to 1 in 250 births for L1, 1 in 20 births for Alu insertions, and 1 in 916 births for SVA retrotransposons [104][105][106][107]. Although there are several recent studies that estimate transposition rate in humans using NGS data, they all focused on somatic transposition in the brain or in tumor samples [47,48,90].

Transposition Control Mechanisms
Understanding the mechanisms controlling the transposition of TEs is central to our understanding of TE dynamics. Many different mechanisms of TE regulation have been described [43,108,109]. In this section, we will highlight recent advances in both TE self-regulation and regulation by host factors.

TE Self-Regulation
Self-regulation of transposition was first described in prokaryotes and soon after in TEs involved in hybrid dysgenesis in Drosophila [110]. Recent studies have cast some doubt on one of the selfregulation mechanisms described: transposase overproduction inhibition. The transposase overproduction inhibition mechanism regulates the transposition of IS630-Tc1-mariner piggyBac and hobo-AC-Tam (hAT) superfamilies [111,112]. However, several studies reported contradictory results suggesting that transposase inhibition by overproduction does not always happen [113]. Bire et al. [113] suggested that some works failed to detect transposase inhibition because cellular cofactors are necessary to execute this regulation system, and as such it can only be detected in in vivo experiments [113]. However, Woodard et al. [114] showed that aggregation of transposase proteins produces filamentous structures (rodlets) in the nucleus in a host independent manner [114]. The authors further showed that a decline in transposition occurs after transposase concentrations are high enough for filamentous structures to be visible [114]. Thus, it is still not clear why some in vitro experiments failed to detect transposase overproduction inhibition [114].

Regulation by Host Factors
Small RNAs, such as small-interfering RNAs (siRNAs) and piwiinteracting RNAs (piRNAs), are well-known to play an essential role in silencing TEs and preventing transposition. Several recent reviews highlight the monumental progress in this field [115][116][117][118][119]. In addition to posttranscriptional regulation of TEs, small RNAs are involved in transcriptional regulation as well. In mouse, piRNAs are required for de novo methylation and silencing of TEs [120]. In Drosophila, Piwi proteins repress transcription and correlate with an increase in repressive chromatin marks at loci targeted by piRNAs [121].
While the role of siRNAs and piRNAs has been established for several years, a role of micro RNAs (miRs) in suppressing the mobility of retrotransposons was only recently described [122]. The authors showed that mir-128 binds to L1 RNA and represses its integration in humans [122].
New studies have also provided evidence for the role in TE repression of proteins previously known for their roles in other cellular processes such as interferon-stimulated proteins, the tumor suppressor p53, and the longevity regulating protein SIRT6. Several interferon-stimulated genes, such as the Moloney leukemia virus 10 (MOV10), the zinc-finger antiviral protein (ZAP), and the 3 0 repair exonuclease 1 (TREX1), which are associated with virus response, have been recently involved in the inhibition of L1 activity [66,123]. Recently, it has also been shown that the p53 transcription factor, which is involved in stress response networks and acts to restrict oncogenesis, also restricts retrotransposon activity in zebra fish, flies, and humans [124]. The authors showed that p53 interacts with components of the piwiinteracting RNA to suppress retrotransposition [124]. Finally, the longevity regulating protein SIRT6 is also involved in retrotransposon repression by coordinating their packaging into transcriptionally repressive heterochromatin. SIRT6 binds to the 5 0 UTR region of retrotransposons and mono-ADP ribosylates the Krüppel-associated protein 1 (KAP1) facilitating the interaction of KAP1 with the heterochromatin protein 1α (HP1α) leading to chromatin compaction [125].

Natural Selection Against TE Insertions
Natural selection and stochastic processes influence both the rate of fixation and the frequency distribution of TEs in populations. The efficiency of selection depends on the effective population size, which largely differs between Drosophila and humans: >10 8 and 10 4 , respectively [126,127]. Thus, while in Drosophila the high efficiency of selection should led to the removal of slightly deleterious TE insertions, in humans, these insertions may accumulate in the genome. Indeed most of the TE sequences in the human genome are remnants of ancient insertions [12].
A review by Barró n et al. [128] explored the latest insights on the nature of selection acting against the deleterious effects of TEs in D. melanogaster populations [128]. More recently, Kofler et al. [129] analyzed intraspecific TE dynamics between D. melanogaster and D. simulans populations to shed light on the long-term evolution of TEs [129]. They confirmed that most of the TEs are present at low frequencies in D. melanogaster and showed that the same pattern is present in D. simulans. Based on computer simulations showing that 50% of the TE families have temporally heterogeneous transposition rates, and on the differences in TE composition between populations of the same species, the authors suggested that TE activity has recently increased in the two species. They proposed that the demographic history of both species, with a recent colonization of different environments, could be the cause of the high TE activity detected [129].
In humans, a recent study took advantage of the 1000 Genome Project data that reports 16,192 polymorphic TEs to perform the most complete TE dynamics analysis to date [130]. Most of the polymorphic TEs were found to be present at very low frequencies: >93% of TEs showed <5% allele frequency in 26 human populations. These results confirm that overall polymorphic TE insertions are deleterious in humans as was previously suggested with smaller family-specific datasets [131].

TE-Induced Adaptations
Several recent reviews have compiled results that showcase the adaptive role of TEs [19,24,50,59,128]. We would like to highlight the recent discovery of a TE in a fish-like marine chordate that encodes RAG-like proteins with endonuclease-transposase activity [39]. This discovery provides evidence that supports the TE origin hypothesis for the adaptive immune system in jawed vertebrates [39]. Two other recent publications provide experimental evidence for a role of TEs as providers of functional transcription factor binding sites (TFBS) involved in immune response and in cell pluripotency [50,132]. A recent study linked ERV elements in humans with the interferon response pathway [50]. The authors showed that ERVs carrying enhancers have been co-opted to activate different genes involve in inflammatory response activated by interferon. This example shows how the exaptation of one family of TEs could shape a transcriptional network to activate different genes with one trigger system [50]. Sundaram et al. [132] reported mouse-specific TEs that contain multiple transcription factor binding sites for pluripotency transcription factors. The majority of the TEs were experimentally shown to exhibit enhancer activity in mouse embryonic stem cells including an in silico reconstructed ancestral TE. This latter result suggests that ancestral TEs already had transcriptional regulatory sites [132].

Rate of Loss
A recent study estimated genome-wide and superfamily-specific TE deletion rates in D. melanogaster inbred lines [75]. The authors found that most of the deletions involved retrotransposon elements suggesting that the deletions were due to ectopic recombination instead of excision. Deletion rates were smaller than insertion rates estimated in the same inbred lines [75].
In vertebrates, lineage-specific differences in TE deletion rates have been reported [133]. A possible explanation for this observation is that the success of some families results in a competition for the genome resources leading to the elimination of other TE families [133].
In addition to TE deletion rates, DNA loss rates should also be considered. In the human linage, estimates of DNA loss are smaller than estimates of DNA gain, 650 Mb vs. 815 Mb [134], while in D. melanogaster, the rate of DNA loss is higher than the rate of DNA gain [135][136][137].

Horizontal Transfer of TE Insertions
In addition to parent to offspring transmission, TEs can also be horizontally transferred [138][139][140][141]. By combining simulation and analytical approaches, Groth and Blumenstiel [142] suggested that exposure rate to new TE families through horizontal transfer can be an important determinant of TE genomic content when the effects of drift in a population are weak [142]. Thus, larger populations are expected to carry a higher TE content if population exposure rate is proportional to population size [142]. So far, most of the evidence for TE horizontal transfer comes from closely related and geographically close species [140]. There are several examples of horizontal transfer of TEs in Drosophila species, while so far horizontal transfer of TEs has not been described in humans [138].

Conclusion
Recent years have seen an increase in the number of reference genome sequences available as well as of population genome datasets. The availability of all these genome sequences and the development of new bioinformatics tools have allowed us to update our previous estimates of genomic TE content that have increased both in humans and in D. melanogaster. These data has also allowed us to gather more evidence for the functional impact, both detrimental and beneficial, of TE insertions. Thus, it is still indisputable that understanding TE population dynamics is essential to understand genome structure, genome function, and genome evolution.
New methods developed to analyze the dynamics of TEs in populations have shed light on the interplay between autonomous and nonautonomous TE copies, TE invasion dynamics, and how the mating system influences the dynamics of TEs in genomes. We have also considerably advanced our knowledge on the host factors that regulate TE activity as well as in the genome features that influence TE dynamics (Fig. 3). Finally, differences in effective population sizes that affect the efficiency of selection against new TE insertions and differences in the rates of TE loss between humans and D. melanogaster can still be considered two important factors that contribute to the different abundance, diversity, and activity of TEs in this two species [103].

Questions
How differences in the rate of DNA loss can affect the evolutionary dynamics of TEs?
Why host regulation of transposition is relevant for TE dynamics?
Which is the most important factor explaining the differences in TE content, diversity, and activity between humans and Drosophila? Have the next-generation sequencing (NGS) technologies allowed us to identify all the TEs in a given genome?
How does the interaction between active and inactive copies of TEs affect TE dynamics?  (2007)