Genome-wide analysis of a recently active retrotransposon, Au SINE, in wheat: content, distribution within subgenomes and chromosomes, and gene associations

Key message Here, we show that Au SINE elements have strong associations with protein-coding genes in wheat. Most importantly Au SINE insertion within introns causes allelic variation and might induce intron retention. Abstract The impact of transposable elements (TEs) on genome structure and function is intensively studied in eukaryotes, especially in plants where TEs can reach up to 90% of the genome in some cases, such as in wheat. Here, we have performed a genome-wide in-silico analysis using the updated publicly available genome draft of bread wheat (T. aestivum), in addition to the updated genome drafts of the diploid donor species, T. urartu and Ae. tauschii, to retrieve and analyze a non-LTR retrotransposon family, termed Au SINE, which was found to be widespread in plant species. Then, we have performed site-specific PCR and realtime RT-PCR analyses to assess the possible impact of Au SINE on gene structure and function. To this end, we retrieved 133, 180 and 1886 intact Au SINE insertions from T. urartu, Ae. tauschii and T. aestivum genome drafts, respectively. The 1886 Au SINE insertions were distributed in the seven homoeologous chromosomes of T. aestivum, while ~ 67% of the insertions were associated with genes. Detailed analysis of 40 genes harboring Au SINE revealed allelic variation of those genes in the Triticum–Aegilops genus. In addition, expression analysis revealed that both regular transcripts and alternative Au SINE-containing transcripts were simultaneously amplified in the same tissue, indicating retention of Au SINE-containing introns. Analysis of the wheat transcriptome revealed that hundreds of protein-coding genes harbor Au SINE in at least one of their mature splice variants. Au SINE might play a prominent role in speciation by creating transcriptome variation. Electronic supplementary material The online version of this article (10.1007/s00299-017-2213-1) contains supplementary material, which is available to authorized users.


Introduction
Transposable elements (TEs) make up a large fraction of plant genomes (Kidwell 2002), as they can reach up to 90% of the wheat genome (Charles et al. 2008). Retrotransposons are the most abundant class of TEs in plants (Kejnovsky et al. 2012;Kumar and Bennetzen 1999); they are divided into LTR retrotransposons and non-LTR retrotransposons, the latter of which include Long INterspersed Nuclear Elements (LINEs) and Short INterspersed Nuclear Elements (SINEs). SINEs are miniature elements (80-500 bp) that probably originated from an accidental retroposition of polymerase III-derived (pol III, e.g. tRNAs) transcripts (Wicker et al. 2007). Their 5′ region harbors an internal pol III promoter [composed of A and B boxes that are recognized by RNA polymerase III (Arnaud et al. 2001)], a family unique internal region (sized 50-200 bp), and a 3′ region. Their 3′ region can be either AT or A rich and it contains short tandem repeats (3-5 bp) or a poly(A) tail. SINE superfamilies (tRNA, 7SL RNA and 5S RNA) are defined by conserved pol III promoters. SINEs are non-autonomous, as they are 1 3 only capable of transposition using proteins encoded by LINEs elements, while creating TSDs (5-15 bp) (Wicker et al. 2007). Several SINE families have been discovered in plants, such as in Brassica napus (Deragon et al. 1994) Oryza sativa (Hirano et al. 1994), Nicotiana tabacum (Yoshioka et al. 1993), Myotis daubentonii (Borodulina and Kramerov 1999) and others (Deragon and Zhang 2006;Wenke et al. 2011). While SINEs are less abundant in grasses compared to LTR retrotransposons (Kumar and Bennetzen 1999;Sabot et al. 2004), a SINE family termed Au SINE, discovered in high copy numbers in wheat (Ben-David et al. 2013;Yasui et al. 2001) was found to be widely distributed in higher plants (Fawcett et al. 2006;Yagi et al. 2011). The impact of SINEs on plant genomes is poorly studied, while it has been well studied in mammalians, e.g., MIR and Alu elements (Deininger and Batzer 1999;Lev-Maor et al. 2003;Makalowski 2003;Schmid 1998;Schmitz and Brosius 2011;Smit 1996Smit , 1999. Both MIR and Alu elements were found to play a role in the exonization process of protein-coding genes (Schmitz and Brosius 2011). In the exonization process, non-proteincoding sequences, primarily introns, become part of the mature RNA, creating alternative splice variants (Clavijo et al. 2017). It has been reported that fragments of Alu sequences, which exist in ~ 1.4 million copies in the human genome, may appear in the protein-coding region of mature RNAs (Makałowski et al. 1994;Nekrutenko and Li 2001). Nearly 1800 retrotransposon-derived exons were found in humans, mostly Alu-containing transcripts (Schmitz and Brosius 2011). However, the frequency of Alu-containing transcripts was found to be much lower than the alternatively spliced exons that do not contain an Alu sequence (Schmitz and Brosius 2011). In most cases, the insertion of Alu into the coding regions of mRNAs creates frame-shifts or premature termination codons, but sometimes it also creates new protein functions or had modified existing ones (Hilgard et al. 2002).
In a previous report, we showed that Au SINE retains retrotranspositional activity following allopolyploidization events in wheat (Ben-David et al. 2013). In this study, the availability of updated genome drafts for several wheat species, especially the genome draft and the RNA-seq database of bread wheat (T. aestivum) facilitated a genome-wide analysis of Au SINE in the wheat genome and transcriptome. We have retrieved Au SINE-containing sequences distributed among the seven homoeologous chromosomes of T. aestivum, and found strong association with hundreds of protein-coding genes; in most of our cases, Au SINE was found to be inserted within the introns of a gene. We then analyzed the impact of Au SINE on the structure of genes and found allelic variations of many genes, based on insertional polymorphism of Au SINE in various wheat species. Expression analysis of several genes by real-time RT-PCR, revealed that Au SINE might undergo exonization in T. aestivum. Genome-wide, in-silico analysis of the T. aestivum transcriptome revealed that tens of protein-coding genes harbor Au SINE in their coding sequence. Detailed analysis of 83 genes showed that at least 50 of them showed splice variants including or excluding an Au SINE. The possible impact of Au SINE on gene structure and function is discussed.

Results and discussion
Genome-wide analysis of Au SINE in genome drafts of T. urartu, Ae. tauschii and T. aestivum The publicly available sequence drafts for T. urartu, Ae. tauschii and T. aestivum facilitated a genome-wide analysis, including: copy numbers, insertion sites and distribution of Au SINE elements in bread wheat and its diploid ancestors. The relatively short sequence (181 bp) (Deragon and Zhang 2006;Yasui et al. 2001) of Au SINE allowed us to identify and characterize intact elements together with their insertion sites. In addition, the updated genome draft sequence of T. aestivum was published for each chromosome separately, which allowed the analysis of Au SINE content in each one of the three subgenomes (A, B and D), and analysis of the distribution of Au SINE in the seven homoeologous chromosomes. To this end, using the MITE analysis kit (MAK) (Yang and Hall 2003b), we have retrieved 133, 180 and 1886 intact Au SINE insertions from T. urartu, Ae. tauschii and T. aestivum genome drafts, respectively. The copy number of Au SINE in the allohexaploid T. aestivum genome was ~ tenfold its copy number in the diploid genomes, T. urartu and Ae. tauschii, indicating the massive retrotransposition burst of Au SINE following allopolyploidization events; most probably, the retrotransposition burst occurred following allotetraploidization because a similar content of Au SINE was found in the genome draft of Triticum turgidum ssp. dicoccoides (data not shown). This finding provides additional evidence for our previous report, wherein, using realtime quantitative PCR analysis, we found that the content of Au SINE was up to tenfold higher in allopolyploid wheat species compared to diploid species (Yaakov et al. 2013b).
Of the 1886 retrieved Au SINE insertions from the T. aestivum genome, 1849 were mapped to the seven homoeologous chromosomes (Fig. 1), distributed among the seven chromosomes of the three subgenomes: AA, BB and DD. The copy number of Au SINE in the AA subgenome was ~ fivefold higher than its copy number in the diploid AA genome (753 vs. 133, respectively), indicating proliferation of the element in the AA genome following allopolyploidization. In addition, the copy number of Au SINE in the DD subgenome was nearly similar to its copy number in the DD diploid genome (221 vs. 180, respectively), indicating a lack of proliferation in the DD subgenome following allohexaploidization, and thus the retrotransposition burst of Au SINE might occurred at the allotetraploid level, around 0.5 million years ago (Feldman and Levy 2005) Our data strongly indicate that although Au SINE is an ancient retrotransposon family (arising prior to the divergence of monocots and eudicots), found in many groups of higher plants (Fawcett et al. 2006;Yagi et al. 2011), it retained retrotranspositional activity in the Triticum-Aegilops genus.
Analysis of the common insertions of Au SINE among the three genome drafts: T. urartu, Ae. tauschii and T. aestivum revealed that only 24 of the 2199 total insertions are common (monomorphic insertions in the three species), indicating the massive proliferation of Au SINE after the divergence of the diploid species, around 4 million years ago (Feldman and Levy 2005). Note that the analysis was done based on 100% identity (e value = 0) of the insertions sites (Au SINE-flanking sequences), as such the results might be underestimated. In addition, 124 (53.1%) of the 221 insertions in the DD subgenome of T. aestivum are in common with Ae. tauschii, while only 77 (10.2%) of the 753 insertions in the AA subgenome of T. aestivum are in common with T. urartu (the diploid donor of genome AA). This finding supports our above conclusion that the retrotransposition burst of Au SINE might have occurred at the allotetraploidization level in the AA and BB subgenomes.
In a previous study (Ben-David et al. 2013), we reported on ~ 38% of the 3706 retrieved Au SINE insertions from the publicly available unassembled 454 pyrosequencing of T. aestivum were in transcribed regions. Here, we report on nearly half of the number of Au SINE insertions (1886) in the assembled and sorted genome draft of T. aestivum [(Clavijo et al. 2017) (http://www.ebi.ac.uk/ena/data/view/ GCA_900067645.1, plants.ensembl.org/Triticum_aestivum/)], indicating the sequence redundancy in the 454 pyrosequencing data (Brenchley et al. 2012), as we have noted in our previous study (Ben-David et al. 2013). Annotation of the 1886 Au SINE insertion sites revealed that 1268 (67.2%) insertions were located within or near (up to 500 bp upstream or downstream) the DNA sequence of predicted protein-coding genes, 213 (11.2%) insertions within non-coding RNA (ncRNA) sequences, 253 (13.4%) insertions within other class I (173 insertions) and class II (80 insertions) TEs, and the remaining 152 Au SINE insertions were in non-coding DNA sequences. The data demonstrate that ~ 78.5% of the Au SINE insertions are in transcribed sequences (excluding insertions in other TEs), which might indicate a strong association of Au SINE with genes. Proteincoding genes that harbor Au SINE include: Transcription factors, Zinc finger-containing proteins, Homeobox genes, Methyltransferase, RNA-directed DNA polymerase, DNAdamage-repair, Ethylene-forming-enzyme-like dioxygenase, chromatin-associated protein, WRKY transcription factor, and others (Table S1).

Allelic variation in protein-coding genes caused by Au SINE
To examine whether Au SINE insertions into protein-coding genes might cause allelic variation (based on presence/ absence of the element) among species in the Aegilops-Triticum genus, we have performed site-specific PCR analysis to amplify Au SINE elements within genes in Aegilops and Triticum species (see plant material), including diploid (AA, BB and DD genome species), tetraploid (wild emmer and durum, AABB genome), and hexaploid (AABBDD genome) species. Primers were designed from Au SINE-flanking sequences; so in each case, a larger PCR product represents a full site (presence of an Au SINE in the gene), while a smaller PCR product represents an empty site (absence of an Au SINE in the gene) (see Fig. 2). Note that in most cases PCR products were sequenced for validation of the presence/absence of Au SINE. To this end, 40 arbitrarily selected genes harboring Au SINE in T. aestivum were analyzed by site-specific PCR for presence/absence of the element in the genome of other wheat species (Table 1, supplementary Figure S1). Monomorphic Au SINE insertions in all the tested Aegilops and Triticum species were seen in 6 of the 40 analyzed genes (cases 1-6 in Table 1, see an example in Figure  S1a), indicating old insertions of Au SINE, most probably before the divergence of the Aegilops and Triticum species. For the remaining 34 genes, polymorphic Au SINE insertions in Aegilops and Triticum species were seen ( Fig. 2 and Figure S1). We have classified the insertion patterns into two is defined by its genome composition (AA, BB and DD subgenomes). A total of 1886 Au SINE insertions were retrieved from the T. aestivum genome draft, while 753, 875, and 221 insertions were retrieved from A, B and D subgenomes, respectively. Note that 37 insertions were not mapped in the seven homeologous chromosomes and they are indicated as "unknown" 1 3 main classes: (1) Au SINE has inserted into a gene only in the allopolyploid species (Fig. 2a-c), either in the T. aestivum only (cases 7-8 in Table 1) or in both T. turgidum and T. aestivum (cases 9-24 in Table 1), indicating that these insertions occurred following allopolyploidization; (2) Au SINE insertion occurred in the diploid species and was further inherited to the derived polyploid species, either from the DD genome (Ae. tauschii) donor (cases 25-29 in Table 1; Fig. 2d, e), the AA genome (T. urartu) donor (cases 30-33 in Table 1; Fig. 2f), the BB genome (Ae. speltoides and/or Ae. searsii) donors (cases 34-36 in Table 1, Figure S1w), or the insertion was seen in different diploid donors (cases 37-40 in Table 1, Fig. 2g, h). Note that in some cases polymorphic insertions were seen among different accessions of the same species, creating genetic variation within the same species; for example, insertions in some accessions of T. aestivum (Fig. 2a) or in some accessions of Ae. tauschii (Fig. 2d, e). Additionally, we have used accessions of Ae. speltoides and Ae. searsii because they are considered the potential donors of BB genome to wheat (Feldman and Levy 2005;. To this end, our data indicate the dynamic nature of Au SINE throughout wheat evolution and its strong association with genes that might impact gene structure and function.

Expression analysis of protein-coding genes harboring Au SINE
Sequence analysis of the 40 genes (Table 1) revealed that in most cases (35 of the 40) Au SINE had inserted in the intron region of the gene, indicating that Au SINE might be spliced out in the mature transcripts. In 4 cases (cases Fig. 2 Site-specific PCR analysis using primers from Au SINEflanking sequences. In each panel, the upper arrow represents a "full site" and the lower arrow represents an "empty site". "M" represents the size marker in all the gels, "NC" represents for negative control, ddH 2 0 was used as template in PCR reactions. The PCR analysis was performed in accessions of: BB 1 = Ae. searsii, BB 2 = Ae. speltoides, Note that for all polymorphic Au SINE insertions the difference between the "full site" fragment and the "empty site" is ~ 181 bp, the size of Au SINE. Numbers above each lane represent the genomic replicates. a Au SINE insertion in Putative Serine/threonine-protein kinase (case 8 in Table 1). The "full site" is 399 bp and the "empty site" is 218 bp. The insertion is unique to T. aestivum (amplified in two accessions 1 and 3). Note that the rest of the upper bands are non-specific PCR products as seen by sequence validation. b Au SINE insertion in an Predicted protein (case 22 in Table 1). The "full site" is 395 bp and the "empty site" is 214 bp. The insertion was seen in accessions of T. durum, T. dicoccoides and T. aestivum. c Au SINE insertion in Putative ATP-dependent RNA helicase DHX36 (cases 13 in Table 1). The "full site" is 387 bp and the "empty site" is 208 bp. The insertion was seen in T. durum, T. dicoccoides and T. aestivum. d Au SINE insertion in Inositol hexakisphosphate and diphosphoinositol-pentakisphosphate kinase (cases 27 in Table 1). The "full site" is 377 bp and the "empty site" is 196 bp. The insertion was seen in Ae. tauschii and T. aestivum. e Au SINE insertion in Chloride channel protein CLC-c (cases 25 in Table 1). The "full site" is 323 bp, and the "empty site" is140bp. The insertion was seen in Ae. tauschii and T. aestivum. f Au SINE insertion in an Predicted protein (case 31 in Table 1). The "full site" is 274 bp and the "empty site" is 97 bp. The insertion was seen in T. urartu, T. dicoccoides and T. aestivum. g Au SINE insertion in SIN3 transcription regulator family member B (cases 40 in Table 1). The "full site" is 342 bp and the "empty site" is 165. The insertion was seen in all tested species except for Ae. searsii accessions. h Au SINE insertion in Calcineurin-like metallo-phosphoesterase (cases 36 in Table 1). The "full site" is 352 bp and the "empty site" is 180 bp. The insertion was seen in Ae. speltoides, T. durum, T. dicoccoides and T. aestivum 3, 7, 18 and 33), the Au SINE was found to be part of an exon region, indicating that it might have underwent exonization throughout wheat evolution. In one case (case 4), the insertion was found 100 bp downstream to the gene. The phenomenon of SINE exonization has been reported in several studies in humans and other primates (Lev-Maor et al. 2003;Makałowski et al. 1994;Nekrutenko and Li 2001;Sorek et al. 2002), but has not been reported previously in plants.
Here, we have analyzed the expression of several genes harboring Au SINE insertions in T. aestivum, using realtime RT-PCR, and found that most of those genes are expressed in bread wheat. Note that we have used two sets of primers (Table S2) for the expression analysis; the first set was designed to amplify exon-exon junction rescripts, and the second set of primers was designed to amplify chimeric (Au-SINE/flanking) transcripts, if they exist (see schemes on top of each one of the four panels in Fig. 3). While for most genes the exon-exon junction transcripts were amplified, no chimeric transcripts were seen, indicating that the intron harboring Au SINE was spliced out in the mature RNA. However, the expression analysis of four genes revealed that both the regular transcript (based on exon-exon junction amplification) and the chimeric (Au SINE/flanking) were simultaneously amplified in the same tissue, indicating retention of Au SINE-containing intron (Fig. 3). Note that the purity of each cDNA sample was tested using site-specific PCR reaction with primers from two exons of Actin gene, giving different amplification products for cDNA and genomic DNA. No DNA contamination was detected ( Figure S2). In addition, the melting curves of the 4 cases are presented in Figure S3. The expression analysis of a gene (case 1 in Table 1) that codes for DnaJ homolog [considered as chaperones in eukaryotes (Westermann et al. 1996)], in two accessions of T. aestivum revealed that both the regular transcript (based on primers designed from exon 1-exon two junction), and the chimeric (Au SINE/flanking) were simultaneously amplified at different levels (Fig. 3a). The regular transcript level was ~ 22-fold higher compared to the chimeric transcript in one accession, and ~ 11-fold higher in the second accession. Note that the transcript levels were relative to ACTIN transcription and three biological replicates were used in each accession. The expression analysis of a gene (case 36 in Table 1) that codes for Calcineurinlike metallo-phosphoesterase [involved in phosphorylated proteins substrates, nucleic acids or phospholipids (KEP-PETIPOLA and SHUMAN 2006)] revealed that both the regular and the chimeric transcripts were simultaneously amplified, while the level of the regular transcript was over 160 fold higher compared to the chimeric transcript in both T. aestivum accessions (Fig. 3b). The Au SINE inserted into the intron located between exon 5 and exon 6 of this gene. The expression analysis of a gene (case 8, Table 1) that codes for Putative Serine/threonine-protein "NA (Not available)" represents monomorphic cases in which the original genome that displays Au  (Huala et al. 1997)] revealed that both the regular and the chimeric transcripts were simultaneously amplified, while the level of the regular transcript was over 250 fold higher compared to the chimeric transcript in both T. aestivum accessions (Fig. 3c). The Au SINE inserted into the intron located between exon 8 and exon 9 of this gene. Finally, the expression analysis of a gene (case 6 in Table 1) that codes for Superoxide dismutase [catalyzes the dismutation of the superoxide (Kliebenstein et al. 1998)] revealed that both the regular and the chimeric transcripts were simultaneously amplified, while the level of the regular transcript was over ~ 130 fold higher compared to the chimeric transcript in both T. aestivum accessions. In this case, the Au SINE had inserted into the intron located between exon 5 and exon 6 of this gene. The relatively very low expression of the Au SINE-containing transcripts might indicate that these alternative transcripts do not have a major impact on the normal function of the proteins, but they might lead to the creation of modified proteins with new functions, similarly as was reported in animal and human systems (Lev-Maor et al. 2003;Makałowski et al. 1994;Nekrutenko and Li 2001;Schmitz and Brosius 2011;Schwartz et al. 2009;Sorek et al. 2002).

Fig. 3
Relative expression levels of genes harboring Au SINE in two accessions of T. aestivum, as seen by realtime RT-PCR analysis. In each accession, the "regular transcript" compared to the Au SINEcontaining transcript were analyzed. On top of each panel, a scheme of the analyzed gene, while the numbered boxes the exons and the black boxes note the Au SINE insertion. White arrows note the primers used to amplify the "regular transcript", while the black arrows note the primers used to amplify the Au SINE-containing transcript. Expression levels (Y axis) were measured relative to ACTIN, and the exact relative expression (fold) is indicated by on top of each bar. Standard error on top of each bar was measured using three biological replicates. All the analyzed genes display the same trend of significantly higher expression levels of the regular transcript compared to the Au SINE-containing transcript, in the two tested T. aestivum accessions. The 4 analyzed genes: a DnaJ homolog (case 1 in Table 1), b Calcineurin-like metallo-phosphoesterase (case 36 in Table 1), c Putative Serine/threonine-protein kinase (case 8 in Table 1), and d Superoxide dismutase (case 6 in Table 1)

Genome-wide analysis of Au SINE-containing transcripts in T. aestivum
We have performed a genome-wide analysis of Au SINEcontaining transcripts from the updated RNA-seq database of bread wheat (plants.ensembl.org/Triticum_aestivum/Info/ Annotation/) to reveal exonization events of Au SINE. Using the MAK software, 113 Au SINE-containing transcripts (Au SINE and flanking sequences 500 bp upstream and downstream of the element) were retrieved from the T. aestivum transcriptome database. Detailed analysis of the 113 transcripts revealed that they belong to 83 protein-coding genes. Of the 83 genes that harbor Au SINE in their mature transcript, 76 were mapped in the seven homoeologous chromosomes, while 27, 47 and 2 were found in AA, BB and DD subgenomes, respectively (Fig. 4). Detailed analysis using the EnsemblPlants scripts revealed that 50 of the 83 genes showed different splice variants (Table 2), while many of those transcripts harbor Au SINE ( Table 2). The number of splice variants for the 50 genes ranged between 2 and 9 transcripts, while at least 1 splice variant was an Au SINE-containing transcript (Table 2). For example: a gene that codes for putative methyltransferase (Table 2) showed 5 splice variants in chromosome 1A of T. aestivum, while 3 variants harbor Au SINE within their transcript (Fig. 5a); a gene that codes for Zinc finger CCCH domain-containing protein (Table 2) showed 4 splice variants in chromosome 3B, while one variant contained Au SINE within its transcript (Fig. 5b); a Putative WRKY transcription factor (Table 2) showed 2 splice variants in chromosome 5B, one of them harbored Au SINE within its transcript (Fig. 5c); and a Transcription initiation factor (Table 2) showed three splice variants in chromosome 7B, one of them contained Au SINE within its transcript (Fig. 5d). Figure S4 shows genes whereas some of their variants contain an Au SINE insertion within an exon. For example, Figure S4a presents a gene coding for 3-deoxy-manno-octulosonate cytidylyltransferase, mitochondrial protein that two of its variants (1, 2) contain Au SINE insertion in their last exon. Detailed analysis of the coding regions (CDS) in each splice variant revealed that in most cases the CDS of Au SINE-containing transcripts is shorter than the regular transcript leading, if translated, to a shorter protein (Fig. 5). For example: the CDS of the three Au SINE-containing splice variants of putative methyltransferase (Fig. 5a) lead to predicted protein sizes of 433 aa, while the CDS of the regular transcript (Fig. 5a) leads to a predicted protein size of 600 aa; the size of the predicted protein of Zinc finger CCCH domaincontaining protein can reach up to 435 aa in the regular transcript, while it is 386 aa in the Au SINE-containing transcript (Fig. 5b); and the size of the predicted WRKY transcription factor is 495, while 355 aa in the Au SINEcontaining transcript (Fig. 5c). In some cases, Au SINE or part of it became part of the coding sequence. For example, in TRIAE_CS42_2BL_TGACv1_131783_AA0432150 gene (Table 2, Figure S4f), the coding sequence of variant 1 does not contain Au SINE, but the coding sequence of variant 2 starts in a start codon located within an Au SINE insertion. Another example is TRIAE_CS42_3B_TGACv1_224095_ AA0792250 gene (Table 2, Figure S4i) in which the coding sequence of variants 4 + 5 do not contain Au SINE, while the coding sequence of variants 1 + 2 + 3 + 6 contain different parts of the Au SINE insertion in their coding sequence. In variants 3 + 6, the coding sequence starts inside the Au SINE insertion. These data clearly indicate that Au SINE-containing introns underwent retention/exonizaion and became part of the mature transcript of many protein-coding genes.

Conclusions
An updated genome sequence draft for T. aestivum revealed that bread wheat consists of ~ 100,000 genes (Clavijo et al. 2017) and that over 80% of its genome consist of TEs. Our estimation based on the current study and our previous reports (Ben-David et al. 2013;Yaakov et al. 2013a; is that many wheat genes harbor at least one TE insertion, while most of the insertions are in intron regions. Plant TEs are considered one of the main components of the genome that are implicated in creating genetic variation among species. The insertion of TEs within genes might create allelic variation, and by such might impact gene expression. In this study, we provide data which led us to conclude that transposable elements, in this case a non-LTR retrotransposon termed Au SINE in wheat, might considerably impact gene structure and function by creating allelic variation and exonization in protein-coding genes. We have is defined by its genome composition (AA, BB and DD subgenomes). A total of 28, 47 and 2 Au SINE-containing genes were retrieved from AA, BB and DD subgenomes, respectively. Note that 7 Au SINE-containing genes were not mapped in the seven homoeologous chromosomes and they are indicated as "unknown"  used very stringent parameters in the MAK software to retrieve Au SINE insertions from the updated RNA-seq database of T. aestivum (Clavijo et al. 2017); thus, the number of Au SINE-containing transcripts that were retrieved here (113 transcripts) might be an underestimate. We estimate that the intron retention of Au SINE might occur in hundreds of wheat genes. To this end, transcriptional interference induced by intronic retrotransposons might impact the transcription of large number of genes. Alternative splicing generates transcriptome variation that could lead to subfunctionalization of genes and speciation. Finally, because Au SINE is found in the entire plant kingdom (Deragon and Zhang 2006;Fawcett et al. 2006), we hypothesize based on our data that intron retention of Au SINE might be a general phenomenon in plants.

Genomic data
In this study, three publicly available genome drafts were analyzed: (1) T. urartu, the donor of AA genome that was paired-end sequenced using whole-genome shotgun by Illumina [plants.ensembl.org/Aegilops_tauschii/Info/ Index, (Ling et al. 2013)].
(2) Ae. tauschii, the donor of DD genome that was sequenced and assembled in the same way as T. urartu and the assembled scaffolds cover 83.4% of its genome with 90-fold depth reads.

Transcriptomic data
Here, we used the updated publicly available RNA-seq database of T. aestivum found in Ensemblplants [ (Clavijo et al. 2017), plants.ensembl.org/info/website/ftp/index.html]. The library includes cDNA, CDS and ncRNA sequences that were used for annotation analysis in our study.

Retrieval of Au SINE insertions
The sequences of Au SINE were retrieved from these genome drafts and transcriptome, using the MITE analysis kit (MAK) software [a standalone version was kindly provided by Guojun Yang, University of Toronto, (Janicki et al. 2011;Yang and Hall 2003a)]. The publicly available consensus sequence of the Au SINE family (GIRI database at http://www.girinst.org/repbase/update/browse.php) was used as an input (query sequence) in the MAK software and BLASTN was performed against the genomic drafts. For the retrieval of Au SINE-containing sequences from the genome drafts, we have used the MAK function "Member", an e value of 10 −3 and an end mismatch tolerance of 20 nucleotides. In addition, flanking sequences (500 bp from each end) were retrieved together with each one of the insertions, to characterize the insertion sites. A rice-specific MITE, called mPing, was used as a negative control in this analysis and no mPing-related sequences were retrieved in wheat. Redundant sequences were detected by BLAST + software (Camacho et al. 2009) using BLASTN function. We have compared sequences against themselves and excluded the paired element from each couple of sequences that were found to have a 100% identity (100% coverage with an e value of 0 and no gaps). The final output files were then edited using Textpad 7.4 'Regular Expression' functions for cleaning excess data. It is important to mention that we have considered in this analysis truncated elements (at one of the terminal sequences) as being nearly intact elements.

Insertion sites annotation
Annotation of Au SINE-flanking sequences was performed using the complementary-DNA (cDNA), coding sequences (CDS) and non-coding RNA (ncRNA) databases of T. aestivum (taken from EnsemblPlants at plants.ensembl.org/ index.html). In addition, Transposable element consensus sequences from different plant genomes were also used as database in this annotation analysis (taken from ITMI at bot-serv2.uzh.ch/kelldata/trep-db/index.html). Annotation was performed using BLAST + standalone version 2.2.3 with an e value of 10 −10 . The merged 5′ and 3′ flanking sequences were used as query against the mentioned databases.

Plant material, DNA and RNA extraction
In this study, we have used 21 accessions of seven Triticum and Aegilops species including the possible donors of AA (T. urartu, three accessions), BB (Ae. speltoides, and Ae. searsii, three accessions from each species), DD (Ae. tauschii, three accessions) genomes, and the allopolyploid species, T. turgidum (wild emmer and durum wheat, three accessions from each species) and T. aestivum (bread wheat, three accessions) seeds were kindly provided by Moshe Feldman, the Weizmann Institute of Science, Israel and the US Department of Agriculture (npgsweb.ars-grin.gov/gringlobal/ search.aspx). Young leaves of ~ 4 weeks post germination plants were used for DNA (using GeneJET plant genomic DNA Purification Mini Kit, Thermo scientific) and RNA (using TRI reagent, Sigma) extractions. First strand cDNA was created using 5X All-In-One RT MasterMix (Applied Biological Material).

Site-Specific PCR analysis
Insertional polymorphism of Au SINE was analyzed based on primers designed from flanking sequences (both sides) of Au SINE insertion. Primers were designed using PRIMER3 version 4.0.0 (bioinfo.ut.ee/primer3/) (see Table 1

Expression analysis
Real-time quantitative RT-PCR (using 7500 Fast Real-Time PCR system, Applied Biosystems) was used to analyze the expression of genes harboring Au SINE in leaves of bread wheat (T. aestivum). For each gene, two primer pairs were designed; the first to amplify a regular transcript based on exon-exon junction amplification, and the second primer pair designed to amplify a chimeric transcript, if produced, consisting of Au SINE and flanking intron sequence. Primers were designed using the Primer Express v2.0 software and the PRIMER3 version 4.0.0 software (bioinfo.ut.ee/ primer3/). Each reaction contained: 7.5 µl KAPA SYBR FAST qPCR Master Mix, 0.3 µl ROX Low 509-a reference dye for fluorescence normalization (KAPA BIOSYS-TEMS), 1 µl forward primer (10 µM), 1 µl reverse primer (10 µM), 0.2 µl H 2 O (nuclease free water, Hylabs) and 5 µl or of cDNA template (50X dilution). The data were analyzed using the 7500 version 2.0.5 software (Applied Biosystems). The reaction conditions were 20 s at 95 °C, followed by 40 cycles of 3 s at 95 °C and 30 s at 60 °C. To differentiate specific PCR products from nonspecific ones, a melting curve was generated right after amplification by employing a 15 s incubation at 95 °C and a 1 min incubation at 60 °C, after which the temperature was raised by increments of 0.1 °C per sec until reaching 95 °C. Data of each sample were received as Ct, threshold cycle of the PCR amplification reaching a certain level of fluorescence (Livak and Schmittgen 2001), normalized to the Ct of ACTIN, a known single copy gene used as an endogenous control. A comparative 2 −ΔΔC t method was then used to determine the relative expression level of the two targets in each sample. First, each one of the normalized target expression levels in each sample was compared to the normalized target expression level of the reference sample, based on the following equation: Therefore, RQ (the relative expression level) = (2 × primer efficiency) −ΔΔC t . Second, the two targets in each sample were compared to find their relative expression levels. Three technical replicates were used for each reaction to evaluate reproducibility. Standard deviations (SD) were calculated based on these three replications. Note that total RNA (not treated with reverse transcriptase) was used in RT-PCR reaction as a negative control for DNA contamination.

In-silico analysis of Au SINE-containing transcripts
Au SINE-containing transcripts were further examined for all predicted variants of the same gene, as found in Ensemblplants. To validate these sequences are real transcripts and not genomic DNA sequences (due to contamination of the transcriptome), we compared these transcripts with the T. aestivum genome using BLASTN analysis to check whether we find full hits (meaning, genomic DNA sequence), or multiple partial hits for each sequence (meaning, mature RNA transcript). This was done by BLASTN algorithm with comparison of transcripts (query) to genomic database with an e value < 1e −100 (Table S3). This analysis showed whether each transcript had a full match (100% coverage) to sequence in the genomic database, or multiple partial matches to sequences in different locations of the genomic database. Using this analysis, we can eliminate transcripts that had full match and suspected to be genomic DNA or precursor RNA. To this end, all transcripts used here are mature RNA transcripts. The translated region of each transcript was determined by the CDS (coding sequence) as found in Ensemblplants. Each transcript containing an Au SINE insertion was traced back to its gene by transcript accession. All predicted variants of the same gene were examined in BLASTN analysis vs. Au SINE sequence and the specific location of insertion was determined in each variant.
Author contribution statement DK: Generated the insilico analysis data, analyzed results and manuscript preparation. CD: Generated the PCR and RT-PCR data, analyzed results. KK: (corresponding author). analyzed results, manuscript preparation and submission.