Transposable elements play an important role during cotton genome evolution and fiber cell development

Transposable elements (TEs) usually occupy largest fractions of plant genome and are also the most variable part of the structure. Although traditionally it is hallmarked as “junk and selfish DNA”, today more and more evidence points out TE’s participation in gene regulations including gene mutation, duplication, movement and novel gene creation via genetic and epigenetic mechanisms. The recently sequenced genomes of diploid cottons Gossypium arboreum (AA) and Gossypium raimondii (DD) together with their allotetraploid progeny Gossypium hirsutum (AtAtDtDt) provides a unique opportunity to compare genome variations in the Gossypium genus and to analyze the functions of TEs during its evolution. TEs accounted for 57%, 68.5% and 67.2%, respectively in DD, AA and AtAtDtDt genomes. The 1,694 Mb A-genome was found to harbor more LTR(long terminal repeat)-type retrotransposons that made cardinal contributions to the twofold increase in its genome size after evolution from the 775.2 Mb D-genome. Although the 2,173 Mb AtAtDtDt genome showed similar TE content to the A-genome, the total numbers of LTR-gypsy and LTR-copia type TEs varied significantly between these two genomes. Considering their roles on rewiring gene regulatory networks, we believe that TEs may somehow be involved in cotton fiber cell development. Indeed, the insertion or deletion of different TEs in the upstream region of two important transcription factor genes in At or Dt subgenomes resulted in qualitative differences in target gene expression. We suggest that our findings may open a window for improving cotton agronomic traits by editing TE activities.


INTRODUCTION
Early in 1950s, geneticist Barbara McClintock first noted the transposable elements (TEs) in maize (McClintock, 1948). She named them "controlling elements" because of her observation that it could alter the transcription of the gene nearby (McClintock, 1956). In the following decades, TEs did not get enough attention and even have been described as "junk" or "parasitic" DNA from most other sci-entists. However, recently mounting evidence was collected to suggest that TEs contributed to the gene regulatory network significantly (Feschotte, 2008). McClintock's discovery that TEs can control and modulate gene expression has now been supported by experiments across ever wider range of organisms.
TEs were categorized as class I (retrotransposons) and class II (DNA transposon) (Wicker et al., 2007). The retrotransposons, mainly including autonomous elements of LTR (long terminal repeat), DIRS_like (dictyostelium intermediate repeat sequence), PLEs (penelop-like elements), LINEs (long interspersed nuclear elements) and SINEs (short interspersed nuclear elements) as well as nonautonomous elements, transpose by a "copy and paste" manner that used an RNA intermediate to make DNA copy. The LTR-type retrotransposons are rich in plants and divided into two major superfamilies: gypsy and copia, which differ in the order of the open reading frames (ORFs) that encode transposition related proteins. DNA transposons transpose a DNA intermediate within the genome that can also be classified in two subclasses (Wicker et al., 2007) One subclass "cut and paste" a double strand DNA in the genome, while the other subclass mobilize a single strand DNA with a "rolling circle" mechanism during DNA transposition ).

TE EFFECTS ON GENOME EVOLUTION
TE transposition together with hybridization, polyploidy or whole genome duplication (WGD), recombination, and horizontal gene transfer act in a concert to promote genome evolution (Oliver et al., 2013). It affects genome composition in two aspects. (i) Increase genome size: TEs often make up major parts of angiosperm plant genomes. TE-rich genomes such as maize (with up to 84% TE content (Schnable et al., 2009)) have been described as "gene islands surrounded by TE sea" (Oliver et al., 2013;Bennetzen et al., 2014;Tenaillon et al., 2010) and TE expansion plays important roles during genome expansion (Bennetzen et al., 2014;Piegu et al., 2006;Hawkins et al., 2006). For example, three LTR-type retrotransposon families in a wild rice became active within the last three million years that is responsible mainly for doubling its genome size (Piegu et al., 2006). (ii) Remove and rearrange genome fragment: unequal homologous recombination (HR) and illegitimate recombination (IR) are two major mechanisms of DNA removal . TE also exerts roles in genome DNA removal. HR processes occurred between TEs of the same family were considered as one route for DNA translocation. Transposition of class II type TE with "cut and paste" mechanism may cause chromosome breaks that is utilized by IR during DNA removal as well.

TE EFFECTS ON GENE EXPRESSION
In addition to those macroscopic effects on the overall structure of genome, TEs cause widespread changes in the levels of gene expression. In recent years, plant biologists revealed numerous examples that TE mediated genotypic changes lead to different plant phenotype variations, which covers almost the whole plant life cycle, from seed formation to blossom and fruiting. Figure 1 shows several examples of TE-associated variations in higher plants. In maize, an insertion of LTR-type retrotransposon in b1 exon resulted in its ectopic expression and anthocyanin accumulation kernels while a different TE insertion results in reduced and variegated expression of the gene with colors accumulated only in very few of the kernels ( Figure 1A) (Selinger et al., 2001). In grape, the insertion of Hatvine1-rrm, a DNA TE in the VvTFL1A promoter caused up-regulation of the corresponding allele in reproductive and vegetative organs of the shoot apex of grape ( Figure 1B) (Fernandez et al., 2010). In oranges, the insertion of LTR-type retrotransposon in the upstream of Ruby gene resulted in ecotopic and temperature-dependent expression of the gene in the flesh to produce so called "blood oranges" ( Figure 1C) (Butelli et al., 2012). While the insertion of a Gret1, an LTR-type retrotransposon reduced Vvmby1A expression that leads to purple color loss in the grape mutant ( Figure 1D) (Kobayashi et al., 2004). In tomato, an unusual 24.7 kb gene duplication event mediated by the LTR-type retrotransposon "Rider", which resulted in the increased IQD12 (SUN locus) expression relative to that of the ancestral copy, culminating in an elongated fruit shape ( Figure 1E) (Xiao et al., 2008). "Hopscotch", a transposable element inserted in a regulatory region of the maize domestication gene, teosinte branched1 (tb1), acts as an enhancer of gene expression to increase apical dominance in maize compared to its progenitor, teosinte. Molecular evidence indicates that the "Hopscotch" insertion predates branch domestication of maize by at least 10,000 years ( Figure 1F) (Studer et al., 2008). A CACTA-like transposable element (TE) represses the expression of ZmCCT, another maize domestication gene, to reduce photoperiod sensitivity, thus accelerating maize spread to long-day environments . In rice, an LTR-type retrotransposon insertion refunctionalize a "sleeping" R gene pit (disease resistance gene) in the rice genome to enhance fungal resistance in rice ( Figure 1G) (Hayashi et al., 2009). TE mediated changes in plants (and indeed in any eukaryote), that span from subtle quantitative effects on target gene expression to rewiring of regulatory networks and to new gene evolution, are categorized by either genetic or epigenetic functions. Genetic changes caused by TE insertion involves target gene mutation, structural modification, movement, evolution of novel gene and modulation of expression (Lisch, 2013). Figure 2 summarized some examples of TE-mediated epigenetic effects. In Figure 2A, the chromatin remodeling factor, deficient DNA methylation 1 (DDM1), is essential for DNA methylation, and its mutation has been reported to cause a profound loss of DNA methylation in Arabidopsis. For example, the DNA methylation of a SINE element silences its downstream gene locus of FLOWERING WAGENINGEN (FWA) in vegetative tissues. In ddm1 mutant, the loss of DNA methylation in the SINE caused ectopic expression of FWA and results in a late flowering phenotype (Kinoshita et al., 2007). A second example was from several generations in ddm1 mutant background. The loss of DNA methylation of a LINE activates TE transcription and produces antisense strand to its downstream gene BONSAI, finally resulting in the spread of DNA methylation and transcriptional silencing of this region, and severe dwarfing . In figure 2B, the . B, Insertion of a DNA TE in the VvTFL1A promoter increases branches significantly in cluster structure of grape (a wild-type is shown on the left with the mutant on the right). C, The orange "Navalina" shows limited expression of Ruby and the flesh is yellow (upper left). In "Tarocco", an LTR-type retrotransposon drives expression in the flesh with a resultant red color (upper right). Recombination between the two LTRs of the retrotransposon resulted in enhanced expression of Ruby and produced cultivar "Maro I" with purple flesh (lower left). "Jingxian", a Chinese variant with a similar retrotransposon in same allele also exhibits red flesh (lower right). D, Insertion of an LTR-type retrotransposon reduced Vvmby1A expression that changes the grape skin from purple (top pannel) to white (lower pannel). E, Ectopic expression of IQD12 (SUN locus) caused by retrotransposon produces long tomato fruits (a wild-type is shown on the left with the mutant on the right). F, Insertion of a copia-like retrotransposon in the upstream region of tb1 reduced branch formation significantly during the domestication from teosinte (left) to maize (right). G, Transcriptional activation of a rice disease resistance gene pit by an LTR-type retrotransposon enhances fungal resistance in resistant species (left) compared to its variant (right). miniature inverted repeat transposable element (MITE) produces siRNAs to block expression of the flanking genes by H3K9me2 methylation to control agricultural traits in rice . In figure 2C, the first intron of the FLOWERING LOCUS C (FLC) gene was inserted with a DNA TE (mutator-like element, MULE) in Ler allele, the epigenetic silencing of the TE repress the FLC expression by siRNA-involved silencing machinery. Researchers showed that the transcript of the Athila LTR-type retrotransposon, through the activity of RNA-dependent RNA polymerase 6 (RDR 6), Dicer-like 2 and 4, was processed into siRNAs in Arabidopsis pollens (McCue et al., 2012). One of these TE-siRNAs incorporated into argonaute (AGO) to target UBP1b and inhibit its translation ( Figure  2D).
In summary, TE may act as cis-or trans-regulating fac-tors to modulate the expression of flanking or distant genes, respectively. TEs may cause transcriptional gene silencing (TGS) by methylating DNA (McCue et al., 2012) or histone, or impose post-transcriptional gene silencing (PTGS) by DNA cleavage or translational repression. But small RNA (miRNA or siRNA) might be the common actor in these different mechanisms based on recent experimental data Yelina et al., 2015;He et al., 2015).

TE EFFECTS ON COTTON GENOME EVOLUTION
Cotton is the largest source of natural textile fiber and a significant oilseed crop. The cotton (Gossypium genus) with  , 2007). The A genome that produces the first spinnable fibers was evolved soon after its divergence from the F genome less than 5 MYA. The allotetraploid (AtAtDtDt) species G. hirsutum and G. barbadense were formed following an interspecific hybridization between the A-genome diploid and the D-genome diploid 1-2 MYA ( Figure 3). As the genus diversified and spread in different habitats, it under-went dramatic genome variation to create adaptive plasticity for natural selection. At present, eight diploid genome groups (A, B, C, D, E, F, G and K) are recognized based on cytogenetic and genetic observations of chromosome pairing behavior, chromosome sizes, and relative fertility in interspecific hybrids (Hawkins et al., 2006). The genome diversity corresponded fully to the widerange phenotypic variations in the Gossypium genus, such as herbaceous to woody stem, corolla colors (almost span a rainbow), leaf shape and seed size (Fryxell et al., 1979). Fiber or the so-called seed coverings are extraordinarily diverse in Gossypium that ranged from nearly glabrous to short and brown hairs, finally to long, fine white fibers in cultured allotetraploid species. Spinnable fiber evolved only in the ancestor of modern A-genome cottons, which afterwards granted the ability to form tetraploid species. Two A-genome diploid species G. arboreum and G. herbaceum, and two allotetraploid species, G. hirsutum and G. barbadense were independently domesticated for cotton fiber traits at least 4,000 years ago (Figure 4) (Dillehay et al., 2007).

TES FOUND IN DIFFERENT COTTON GENOMES
Polyploidization and genome size variation may hold the   Li et al., 2014). The green boxes indicate the domestication history of cotton that started about 4,000 years ago. key for future improvement in cotton fiber yield and quality. Since 2012, several joint research groups including our labreported the draft genomes of two ancestor diploid cotton, Gossypium raimondii (D)  and Gossypium arboreum (A) (Li et al., 2014) together with the cultivated allotetraploid Gossypium hirsutum (AD) Cao, 2015). At about the same time, two other joint groups assembled the same Gossypium raimondii (D) (Paterson et al., 2012) and Gossypium hirsutum (AD) . Table 1 shows the annotations for three genomes based on our joint efforts. The D genome contains 40,976 protein-coding genes with 441.4 Mb, or 57% of the genome as TEs. The genome possessed about 398,000 gypsy-type and about 185,000 copia-type retrotransposons that accounted for 78.6% of the TEs. The A genome contains 41,330 protein-coding genes with 1,160 Mb, or 68.5% of the genome as TEs. The genome has 1,086,000 gypsy-type and 186,000 copia-type retrotransposons. The AD genome is 2,173 Mb in length with 1,471 Mb, or 67.2% of the genome as TEs. It contains about 1,128,000 gypsy-type and 276,000 copia-type retrotransposons. Over 90% of the TEs in either A or the AD genomes are composed of retrotransposons, which is significantly different from that of the D genome.

TE EXPANSION IN COTTON GENOMES
In plant, except for polyploidization, TE amplifications appear to be the major contributor to genome size inflation. For example, the maize genome doubled in size in just a few million years, due most likely to increased TE activity (Sanmiguel et al., 1998). In Gossypium, more than three-fold variations of genome size, from ~800 Mb (1C) in the D-genome to ~2,500 Mb (1C) in the K-genome, were found despite of the fact that all diploid species share 13 chromosomes ( Figure 3). Previous studies predicted that the variations of genome size of diploid species reflect the volume of copy numbers of repeat DNA sequences (Zhao et al., 1998), especially retrotransposon elements (Hawkins et al., 2006).
Through analyzing the TE divergence rate distribution, we confirmed the expansion of retrotransposons in the G. raimondii genome during the last 1-3 million years . In G. arboreum, two major clusters of active LTR-type retrotransposons were found at 0-0.5 and 3.5-4.5 MYA, with two more minor sets occurred around 1.0 and 7.0-8.0 MYA (Figure 4), which indicated that bursts of LTR-type retrotransposon activities may be responsible for the two-fold increase in G. arboreum genome compared with G. raimondii, similar to previously reported for maize (Swigoňová et al., 2004). Further analysis for syntenic blocks present on chromosome 7 of both G. arboreum and G. raimondii (3.5 and 1.5 Mb in length, respectively) showed that the protein coding genes in these two blocks were highly collinear whereas the block from G. arboreum contains 4,098 TEs and the one from G. raimondii has only 1,542. A total of 2,377 Gorge elements (one group of gypsy-type retrotransposons) were found in this region from G. arboreum, whereas only 324 of them were found in the same region from G. raimondii (Li et al., 2014).
In allotetraploid Gossypium hirsutum, based on calculation of spontaneous mutation rate and transcriptome data, we suggested that LTR-type retrotransposons were active in allotetraploid genome and copia elements were remarkably more active than gypsy in the most recent 0-1 million years . Analysis also showed that there were higher proportions of copia located near coding genes than gypsy and TEs of the Dt subgenome tend to be more active than that of At subgenome after the tetraploidization. Thus we believe that rapid structure and epigenetic reorganization of genome occurred during the early stage of polyploid formation.

TES MAY BE INVOLVED IN COTTON FIBER DEVELOPMENT
Cotton fiber is unicellular, unbranched, simple trichome cell derived from ~25% of the epidermal cells (>25,000 cells) of  (Fryxell, 1979;Dillehay et al., 2007;Wang et al., 2012). a developing cottonseed (Li et al., 2005). Great differences in fiber properties are found among the three cotton species. For example, the allotetraploid G. hirsutum usually produces fibers with >3 cm in lengths, whereas G. arboreum produces fibers of 1.3-1.5 cm long, and no spinnable fiber is produced by G. raimondii (Figure 4). The G. hirsutum fiber cells undergo fast elongation until ~30 days post anthesis (DPA), whereas those of G. arboreum stop growth around 20 DPA. Comparative analysis with that of T. cacao and A. thaliana revealed that the cotton genomes harbored higher TE content and more TEs were inserted near (within 1 kb of) the cotton genes Li et al., 2014;. Although the protein coding capacities of these two subgenomes are essentially maintained in the allotetraploid, the expression patterns of a large number of functional genes were significantly different from any of the diploid ancestor. These results indicate that the TE-mediated gene regulation may modulate the unbalanced expression of homologous gene pairs in the allopolyploid. In the past several decades, many genes involved in regulating fiber elongation were identified. Here we investigated TE distributions in the close vicinity of these identified gene loci. Figure 5A shows the genes that have TE insertion in upstream of coding sequence within the scope of ~5 kb and their roles in cotton fiber development based on previous reports (Huang et al., 2013;Han et al., 2013;Li et al., 2005;Wang et al., 2009;Shi et al., 2006;Qin et al., 2007;Qin et al., 2005;Li et al., 2013;Xu et al., 2009;Shan et al., 2014;Walford et al., 2011;Wang et al., 2013;Luo et al., 2007;Pei, 2015). We identified a copia-like retrotransposon insertion in the promoter region of a MYB-domain transcription factor only in the Dt subgenome that is positively correlated with its higher expression in the allotetraploid ( Figure 5B). Similarly, a LINE re- trotransposons insertion in promoter of an ethylene response factor (ERF) gene in Dt subgenome is related to the Dt expression bias of homologous genes during the ovule cell differentiation process ( Figure 5C). Both genes were previously known to be essential for cotton fiber and trichome development (Shi et al., 2006;Qin et al., 2007;Walford et al., 2011;Zhang et al., 2010).
Thus TEs insertion in cotton development associated genes and their polymorphism among AA, DD, Dt and At genomes/subgenomes may be key to understand the vast transcript level and expression pattern variations among different cotton genomes. Also, these data seems to suggest an indispensible role for TEs during the evolution and artificial selection of cotton fiber associated traits.

CONCLUDING REMARKS
The assembled two diploid ancestor cottons, Gossypium raimondii (D) and Gossypium arboreum (A) together with the cultivated allotetraploid Gossypium hirsutum (AD) genomes provide references for genome re-sequencing in multiple species of Gossypium. Due to a recent burst of high throughput sequencing based technologies, identification of TE polymorphisms by genome re-sequencing gradually became a reality. Transcriptomic, epigenomic analyses and especially functional studies such as gene editing may be used to elucidate TE-mediated phenotypic changes in Gossypium. The genetic resources obtained from these researches may enhance cotton production and improve fiber quality in the near future. We hope that the discovery in molecular mechanisms from different Gossypium genomes will benefit studies of other plants, and even animals.