Background

Short interspersed nuclear elements (SINEs) are Class I transposable elements (TEs) that propagate by a copy-and-paste mechanism [1, 2]. SINEs are evolutionarily derived from endogenous RNA polymerase III (Pol III) transcripts [3]. While mammalian SINEs, such as B1 and Alu, are originated from 7SL RNAs, other eukaryotes primarily harbor tRNA-like SINEs [4], and SINEs originated from 5S rRNA have been found in zebrafish, fruit bats, and springhare [5, 6]. Recently, SINEs derived from small nuclear RNA (snRNA) (SINEU) and the 3′-end of the large ribosomal subunit (LSU or 28S rDNA) (SINE28) have been identified in avian, crocodilian and mammalian genomes, respectively [7,8,9]. The characteristic features of SINEs include a 5′ terminal RNA-related region which contains an internal Pol III promoter, a central region, and a 3′-tail that is of variable length and recognized by the reverse transcriptase (RT) of autonomous partner long nuclear interspersed element (LINE) during retrotransposition [3]. The SINEs promoters originated from tRNA and 7SL RNA comprise box A and B motif, whereas 5S rRNA-derived SINE promoters have three boxes such as A, IE and C [10].

As non-autonomous retrotransposon, the replication rate and survival of a SINE is dependent on the partner LINE activity, and its genomic copy number varies greatly between families and host species. For example, as high as 1 million copies of Alu elements have been identified in the human genome [11], whereas only two copies of ZmSINE3 were detected in Zea mays [12]. On the other hand, the number of SINE families within a genome is also highly variable, ranging from a single SINE family in the Vitaceae to 22 SINE families in the Amaranthaceae [13]. Interestingly, unlike LINEs, the distribution of a SINE family is generally restricted to a certain taxonomic group such as orders/families [3, 4, 14], suggesting that SINEs are one of the major genetic elements that determine a clade-specific genomic composition.

Transposable elements play an important role in the epigenetic regulation of the genome and generation of genomic novelty. A growing body of evidence has recently accumulated indicating that SINEs have a deep impact on genome organization and gene structure by generating regulatory elements for gene expression [15, 16], exon skipping and alternative splicing [17], alternative polyadenylation signals [18, 19], and even functional RNA genes [20, 21]. For example, an Alu SINE inserted into human pluripotency-associated transcript 5 (HPAT5) regulated related microRNAs through its let-7 binding site, which is essential for inner cell mass formation during early embryonic development [22].

While SINEs have been well characterized in human [23], other mammals [24] and plants [25], and currently about 200 SINE families/subfamilies are identified in various clades in Metazoa, as reported in Repbase [26] and in SINEBase [2], information on insect SINEs is still limited [27,28,29,30,31,32,33]. Recent improvements in both genome sequencing and assembly methodologies have led to increasing high-quality insect genome assemblies, which provides the opportunity to identify novel SINEs. However, due to their minimal sequence feature, the lack of coding capacity, and high sequence heterogeneity, annotations of SINE are often incomplete or missing. Here, we described three tRNA-derived SINE families and two 5S rRNA-derived SINE families in the diamondback moth (DBM), Plutella xylostella (L.), which is one of the most damaging insect pests of cruciferous vegetables around the world. We investigated the structures and insertion regions of these SINEs. The distribution of these SINEs in other lepidopteran insect species was also surveyed.

Results

Novel tRNA-derived SINE retrotransposons, PxSE1, PxSE2 and PxSE3, in P. xylostella

A novel tRNA-derived SINE, PxSE1, was identified by homology search in DBM genome database (Additional file 1: Figure S1). A total of 68 full length copies homologous to PxSE1 were used to reconstruct the consensus sequence of PxSE1 (Accession numbers: MW068006-MW068073). The PxSE1 is 263 bp long, includes GT dinucleotide repeats at 3′-tail and a 72-bp tRNA-related region at the 5′-end with 64% identity to 72-bp tRNAArg of Drosophila melanogaster, which contains box A and box B of the RNA Pol III promoter (Figs. 1 and 2a, Additional file 2: Figure S2). The boundary of PxSE1 was further defined by the alignment of a PxSE1 element and its empty site sequence (Additional file 3: Figure S3). Using the PxSE1 as the query, a total of 6208 copies were identified in DBM genome (Table 1). The average divergence is 0.035 in all PxSE1 copies (Table 1), indicating a recent invasion time.

Fig. 1
figure 1

The schematic representation of structure of PxSE1, PxSE2, PxSE3, PxSE4 and PxSE5 in P. xylostella. The A, B, IE and C in tRNAArg or 5S rRNA region represent A box, B box, intermediate element and C box, respectively.The PxSE1, PxSE2 and PxSE3 are tRNA-derived SINEs, PxSE4 and PxSE5 are 5S rRNA-derived SINEs

Fig. 2
figure 2

The consensus sequences of PxSE1, PxSE2, PxSE3, PxSE4 and PxSE5. BmSEm, SINE2-1_PXu and HaSE3 sequences were obtained from Repbase database. tRNA and 5S rRNA sequences were downloaded from D. melanogaster tRNAArg sequence (Accession number: V00243) and B. mori (Accession number: K03316), respectively. a PxSE1 and PxSE2 consensus sequences aligned with tRNA sequence and BmSE. Nucleotides shaded in black are conserved across sequences. The underlined sequences of A Box and B Box are the RNA pol III promoter sequences. b PxSE3 consensus sequence aligned with tRNA-related region and conserved central domain of SINE2-1_PXu. c PxSE4, PxSE5 consensus sequences aligned with 5S rRNA and 3′-region of PxLINE1.1. PxLINE1.1 is a new LINE transposon in P. xylostella

Table 1 Novel SINE elements identified in this study

Using PxSE1 as query, two additional tRNA-derived SINEs, PxSE2 and PxSE3, were identified by database searches. The consensus sequences of PxSE2 and PxSE3 were reconstructed using the same methods as described above (Accession numbers: MW068074-MW068156, Additional file 2: Figure S2 and Additional file 3: Figure S3). The PxSE2 is 263 bp long, includes a 143 bp 3′-end sequence, which is different from PxSE1, but has 67.5% identity with BmSE. The 72-bp tRNA-related region of PxSE2 is 93.4% identical to PxSE1 (Figs. 1 and 2a). Interestingly, PxSE2 has a 44 bp conserved central domain with 93.2% identity to PxSE1 (Fig. 2a). The PxSE3 is 339 bp long, includes a 72-bp tRNA-related region with 66.5% identity to tRNAArg of D. melanogaster, and has 79.3% identity with the 222 bp sequence at 5′-end of SINE2-1_Pxu from Papilio xuthus [34] (Figs. 1 and 2b). The copy numbers of PxSE2 and PxSE3 were 5056 and 5158 in DBM genome, respectively (Table 1). The average divergence of PxSE2 and PxSE3 were 0.071 and 0.089, respectively (Table 1).

Distribution of PxSE1, PxSE2 and PxSE3 in other species

BLAST searches were performed to detect PxSE1, PxSE2 and PxSE3 sequences in insect species other than P. xylostella. In total, homologous sequences of PxSE1, PxSE2 and PxSE3 were identified in five, two and seven Lepidoptera insects, respectively (Accession numbers: MW068230-MW069451, Additional file 2: Figure S2), among which MsSE2 in Manduca sexta showed the highest copy numbers of 16,157, whereas only 533 copies of CsSE1 were detected in the genome of Chilo suppressalis (Table 1). The consensus sequences of these elements vary in size from 252 bp to 333 bp and have different 3′-tails. Differently, the consensus sequences of EpSE1 did not contain poly(A), poly(T) or simple sequence repeats at 3′-end. The average divergence varied from 0.035 to 0.13 (Table 1). Although PxSE2- and PxSE3-like elements were not identified in non-insect species, a PxSE1-like element, SlNPVSE1, was detected in Spodoptera litura nucleopolyhedrovirus II (EU780426.1: 30485–30735), which was located within ORF27 encoding an unknown protein.

Multiple sequence alignment of the consensus sequences showed that the evolutionary divergence varied from 0.003 to 0.436. The highest identity (99.7%) was observed between PmSE1 in Papilio machaon and PzSE1 in Papilio zelicaon, whereas MsSE1 in M. sexta and CfSE1 in Choristoneura fumiferana showed the highest evolutionary divergence (0.436) (Additional file 4: Figure S4).

Two 5S rRNA-derived SINEs, PxSE4 and PxSE5, in P. xylostella and related species

Using HaSE3 as a query [33], BLAST searches revealed two 5S rRNA-derived SINEs, PxSE4 and PxSE5, in DBM (Accession numbers: MW068157-MW068229, Figs. 1 and 2c). The boundary of PxSE4 and PxSE5 was further defined by the alignment of single PxSE element and its empty site sequence (Additional file 2: Figure S2 and Figure S3). PxSE4 and PxSE5 are both 389 bp in length and shared high identity of 250 bp sequence at 5′-end but are different at the 3′-end. The promoter regions of PxSE4 and PxSE5 include the specific A box, IE and C boxes, and shared about 63% identity with 5S rRNA of Bombyx mori, indicating that they are 5S rRNA-derived SINEs (Fig. 2c). The copy numbers and average divergence of PxSE4 and PxSE5 were 4415 and 1952, 0.078 and 0.132, respectively (Table 1).

Interestingly, we found a LINE element PxLINE1.1 (NW_011952036.1: 552486–555,713) with its 43-bp 3′-end being 84% identical to that of PxSE5 (Fig. 2). Thus, this region was designated as 3′-LINE-related region (Fig. 1). The PxLINE1.1 element was 3228 bp long, flanked by 13 bp target site duplications (TSDs), encoded L1_EN (Endonuclease domain of the non-LTR retrotransposon LINE-1) and RT domain, and was terminated by ATGT tetranucleotide repeats in the short 3′ untranslated region (3′ UTR) (Fig. 3). Additional eight copies were found to be 96.1 to 99.7% identical to PxLINE1.1 in P. xylostella. Specifically, one copy (AHIO01028576.1:13049_14357) from WGS was inserted as a 1686 bp fragment, which shared 71.8% identity with mariner-8_BM from B. mori [35] (Table 2 and Additional file 5: Figure S5). Sequences sharing 63 to 82% identity with the 1580 bp fragments at the 3′-end of PxLINE1.1 were also found in the other 7 lepidopteran insect genomes (Additional file 6: Figure S6).

Fig. 3
figure 3

The nucleotide sequence and conceptual translation of the partner LINE element, PxLINE1.1, for PxSE5. Flanking direct repeats are indicated in lowercase. The nucleotides of TSD are indicated with the wavy line. The nucleotides of 3′ tail sequence are indicated with the straight line

Table 2 Copies with high identity to PxLINE1.1 in P. xylostella

The PxSE4 and PxSE5 sequences were used as queries to search against the whole genome shotgun (WGS) and expressed sequence tags (EST) database using BLASTN. Three elements, LaSE2, CsSE2 and ObSE2, with high identities to PxSE4 were found in genomes of Lerema accius, C. suppressalis and Operophtera brumata, respectively (Accession numbers: MW068230-MW069451, Additional file 2: Figure S2). In particular, the 115-bp fragment at 5′-end of ObSE2 is different from PxSE4, whereas the central 122-bp fragment shares highly identity with PxSE4 (Additional file 4: Figure S4G and Additional file 7: Figure S7B). The 75-bp fragment at 5′-end of ObSE2 is 54.2% identical to the 72-bp tRNA of D. melanogaster, but different from PxSE1 (Additional file 7: Figure S7A). However, no simple repeat sequences were found at the 3′-ends of the ObSE2. While we did not find PxSE5-like elements in other insects, the 56-bp fragment at 3′-end of PxSE5 and SfSE1 shared 89.6% identity (Additional file 4: Figure S4I).

Transpositional burst of SINEs

Due to the accumulation of random mutations over time, evolutionarily ancient SINE families have a lower sequence identity among copies, whereas SINEs families with recent or ongoing transposition harbor relatively homogeneous copies [12]. To evaluate the periods of transpositional activity and relative age of SINE copies per family of SINEs, we performed a pairwise comparison of SINE copies with the consensus sequences of respective family and grouped them into intervals from 80 to 100% identity. As shown in Fig. 4, 4796 of 6208 copies of PxSE1 show more than 95% identity to the consensus sequence, of which 223 copies are 100% identical to PxSE1 consensus sequence (Additional file 8: Table S1), indicating a recent transpositional burst. A strong transposition peak with high identity values is also observed in PxSE2, PxSE3, and PxSE4. However, PxSE5 shows high numbers of diverged copies, and only 49 copies (2.5%) of PxSE5 have more than 95% identity with its consensus sequence (Fig. 4).

Fig. 4
figure 4

Examples for the relative age distribution of SINE families in P. xylostella, M. sexta, C. suppressalis, L. accius and O. brumata. The abscissa showed the identities between each consensus sequence and the copies. The ordinate showed the copy numbers of sequence with the same identity. The same color represented the same family of SINE

The activity profiles deduced from similarity intervals of SINEs in other lepidopteran species revealed a recent transpositional burst of CsSE1 and CsSE2 in C. suppressalis and ObSE2 in O. brumata, whereas LaSE1 and SfSE1 harbour diverged copies and only few young ones (Fig. 4 and Additional file 9: Figure S8). High number of copies with a wide range of identity values were observed in MsSE1, MsSE2, PgSE1, PmSE1 and LaSE2 (Fig. 4 and Additional file 9: Figure S8). Due to few copies in related EST and transcriptome shotgun assembly (TSA) databases, the distribution profiles of copy identity in SlituSE1, SlittSE1, CfSE1, PzSE1, EpSE1 and SeSE1 were not subject to analysis.

Contribution of SINEs to gene and genome evolution in P. xylostella

The integration pattern relative to the annotated genes in the genome of P. xylostella was analyzed. A total of 2750 out of 6208 copies (44%) of PxSE1, 2478 out of 5056 copies (49%) of PxSE2, 2470 out of 5158 copies (48%) of PxSE3, 2265 out of 4415 copies (51%) of PxSE4 and 902 out of 1952 copies (46%) of PxSE5 were found in introns (Fig. 5a). Similar proportions of the copies are distributed in regions 5kbp downstream of genes. Only two, five, five, eight and five copies of PxSE1, PxSE2, PxSE3, PxSE4 and PxSE5 were found to insert into exonic regions, respectively (Fig. 5a). Among them, 11 copies are inserted into the coding regions (CDS), a copy is inserted into the 5′ UTR, and 13 copies are inserted into the 3′ UTR (Table 3). Most of these genes were annotated as enzymes or enzyme-associated proteins, and were related to signal transduction, splicing, metabolism. For example, a 261 bp copy PxSE2.2 of PxSE2 family from DBM genome (NW_011952011.1: 2273356–2273095) inserted into CDS of a gene encoding nitrogen permease regulator 3-like protein. The 21-bp fragment at 5′-end of PxSE2.2 contributed 7 amino acids to the N-terminus of the protein (Fig. 5b).

Fig. 5
figure 5

Gene association of SINEs in P. xylostella. a Overall proportions of SINEs in the genome of in P. xylostella are represented as pie charts. b Integration of a PxSE2 element within the CDS of a gene encoding a nitrogen permease regulator 3-like protein. The sequences with yellow represent the exon region of LOC105380419, the sequences with lowercase is a PxSE2.2 copy of PxSE2

Table 3 The annotation of SINEs copies integrated into CDS and untranslated regions (UTR) in P. xylostella

Further analysis revealed the insertion of multiple copies of SINE families into introns of the same gene. As many as 60 elements inserted into introns of LOC105382892 gene, including 18, 14, 10, 11 and 7 copies of PxSE1, PxSE2, PxSE3, PxSE4 and PxSE5, respectively (Additional file 10: Figure S9). A total of 95 genes were found to be inserted with at least ten copies of SINE elements (Additional file 10: Figure S9D). Thus, the P. xylostella SINE families contribute to structural variation in introns, which might influence the regulation of gene expression.

Evolution and horizontal transposon transfer (HTT) of SINEs

The phylogenetic tree of the 23 SINE consensus sequences showed that the SINEs with the same internal Pol III promoter were clustered together, except ObSE2 SINE (Fig. 6a). Due to the high identity of PxSE1 and PxSE2 at 5′-ends, the clustering of related SINEs in different family, such as PxSE1, PxSE2, CfSE1, ObSE1 and CsSE1, is not surprising. The comparison of phylogenetic tree of PxSE3 family and the taxonomy tree of related host species [36, 37] (Fig. 6) suggests some degree of vertical transmission of PxSE3 family in lepidopteran insects. Interestingly, SlNPVSE1 and SfSE1 in Spodoptera frugiperda, SlittSE1 in Spodoptera littoralis and SlituSE1 in S. litura, were clustered together (Table 1 and Fig. 6a). The orthologous outer flanking sequence of SlNPVSE1 were identified in Spodoptera eridania nucleopolyhedrovirus isolate 251 and Spodoptera cosmioides nucleopolyhedrovirus isolate VPN72, suggesting that SlNPVSE1 inserted into the genome of nucleopolyhedrovirus by HTT. In addition, the inter 5′-flanking sequence (about 800 bp) was found to share 95% identity to the sequence (WNNL01000005.1: 248783–248238) of Spodoptera exigua genome (Additional file 11: Figure S10 and Fig. 7), putatively resulted from unknown horizontal gene transfer.

Fig. 6
figure 6

The evolutionary tree of 23 novel SINEs in this study (a) and the taxonomy tree of lepidopteran insects harboring PxSE3-like SINEs (b)

Fig. 7
figure 7

The evidence of HTT from Lepidoptera to baculovirus. Multiple sequence alignment of SlNPVSE1 and its flanking sequences and the orthologous sequences. Se-WH-S is a host sequence from S. exigua genome (WNNL01000005.1:248783–248238), SlNPV-II is baculovirus sequence from S. litura nucleopolyhedrovirus II (Accession number: EU780426.1:29774–31088) containing SINE copy, SeNPV-251 and ScNPV-vpn72 are orthologous sequecnes of SlNPV-II from S. eridania nucleopolyhedrovirus isolate 251 (Accession number: MH320559.1:31479–31679) and S. cosmioides nucleopolyhedrovirus isolate VPN72 (Accession number: MK419955.1:32601–32796), respectively

Discussion

The structure of three tRNA-derived and two 5S rRNA-derived SINE families

Up to now, more than 234 SINEs have been isolated from the genomes of human, mammals, reptiles, fishes, mollusks, fungus, green plants, and insects [2]. Based on current data, the tRNA-derived SINEs (~ 84%) were found widely in eukaryotic genomes [2]. Apart from the 5′ terminal head, SINEs also consist of typical body and variable repeated tail. In this study, we have identified three tRNA-derived SINE families, PxSE1, PxSE2 and PxSE3. The 45 bp region in body region of PxSE1 and PxSE2 also showed high identity (93.3%) except the highly identical heads. Similarly, two 5S rRNA-derived SINEs, PxSE4 and PxSE5, also shared 98.7% identity in 159 bp region of their bodies. Previous studies have found that the conserved bodies of SINE mainly include the V-domain, CORE-domain, Deu-domain, Nin-domain, Ceph-domain, Inv-domain, Pln-domain, Snail-domain, and Meta-domain [38,39,40,41,42,43]. However, the body regions identified in PxSEs are different from these known domains. A hypothesis has proposed that nonautonomous LINEs that have only 5′ and 3′ regions of original LINEs can be a source of enigmatic middle body of SINEs [1]. Hence, highly identical conserved central domains among different SINEs in the same species suggests that the conserved central domain may originated from the same LINE family and has been under strong selective constraint, which is important for reverse transcription. In addition, despite the high identity between ObSE2 and 5S rRNA-derived PxSE4, ObSE2 is a tRNA-derived SINE.

Partner LINE

SINEs can be composed of 5′ and 3′ regions of nonautonomous LINEs, and their 3′ tails will also exchange with other LINEs under the pressure of natural selection to facilitate rapid amplification [1]. The tail homologous to LINE is important for SINE, which allows the integration of new copies of SINE into the new genomic locations using the LINE RT [44]. LINE RT can specifically recognize the 3′ homologous SINE tails, indicated that SINE can be mobilized by the retrotransposition machinery of a partner LINE [45]. Here, nine novel LINE copies in P. xylostella, and seven LINEs in each of the lepidopteran insects were identified with 3′-end similar to that of PxSE5 and SfSE1, suggesting that the LINE identified in this study is an ancient retrotransposon and might widely exist in Lepidoptera insects. However, the 5′ regions among SfSE1, PxSE5 and PxLINE1 shared a large divergence, indicating these SINEs exploded after the exchange of 3′-tails. Moreover, the distinct 3′-end in other PxSEs suggested that these SINEs might be mobilized by other LINEs that were not identified yet.

Relative age and distribution of SINEs in Lepidoptera insect

The copy numbers of SINEs varies among different families and species. In P. xylostella, the copy numbers of SINEs of tRNA origin is relatively higher than that of 5S rRNA origin. In particular, the copy numbers of PxSE5 is only 1952. Previously, it was speculated that the type 1 promoter in 5S rRNAs is more dependent on upstream signals than the type 2 promoter in tRNAs, resulting in the Pol III promoter in a retroposed 5S rRNA copy presumably remains silent or is expressed at a low level [5]. In different species, the copy numbers of the same origin SINE is different. The copy numbers of MsSE1 and MsSE2 in M. sexta and SfSE1 in S. frugiperda were 7513, 16,157 and 11,117, respectively, whereas only 4521 copies of ObSE1 and 863 copies of ObSE2 were found in O. brumata. The genome sizes of M. sexta and S. frugiperda are around 400 Mb, while O. brumata has larger genome size of 618 Mb. Hence, SINE copy number may not correlate with genome size. Some factors of 3′-tail, such as poly(A) tail or short direct repeats length, sequence conservation and distance to the transcriptional terminator, may affect the retroposition efficiency of the SINE families [46, 47]. In this study, the varied 3′-tail of these SINEs in different species may have affected their distribution in the genome. However, their relationship with the number of copies cannot be determined at this time.

Based on the divergence of the copies from the consensus sequence, the relative age distribution of identified SINEs was analyzed. Scattered age profiles were found in most SINEs among all species or within the genus, suggesting that the activity and accumulation of these SINEs are dynamic processes that can vary considerably between host lineages and SINE lineages. Especially, the highly identity and concentrated PxSE1 showed that it most likely is a relatively young retransposon in the genome of P. xylostella and was generated by recent explosive amplification. The scattered distribution of PxSE5 copies also suggests that it is older than other SINEs.

SINEs contribute to DBM genome evolution

The ability of TEs to replicate and move in the genome affects the genomic structure, gene expression, and the divergence and evolution of host species [48,49,50,51]. The genome size of DBM is 343.575 Mb, of which the intronic region occupies of 35.23% (121.039 Mb) [52]. The integration pattern analysis revealed that the numbers of PxSINEs inserted into introns accounted for 44–51%, only 2–8 copies were inserted into exons, indicating that PxSINEs prefer to insert or accumulate in introns of genic regions. However, the proportions of different SINEs located within introns of Solanaceae range from 15 to 54% [53] and 96% of SINEs inside genes were located inside introns in Zoysia japonica and maize [54], suggesting that the distribution characteristics of SINEs varied in different species. Introns have long been an exemplar of regulated splicing, which affects and enhances almost every step of mRNA metabolism by the act of their removal [55]. In mice, a recent insertion of MT-C retrotransposon into DICER intron truncated its first 6 exons, providing an alternative promoter and a novel first exon. This change resulted in acquizition of oocyte-specific expression and is essential for fertility [56]. We speculate that the insertions of PxSEs into introns may provide signals for alternative splicing and polyadenylation, which may be a reflection of the host response to an ever-changing environment.

Importantly, we also noticed that only 25 copies of SINEs inserted into the genic exonsof DBM, of which 13 copies were found in 3′ UTR. In eukaryotic cells, some proteins (such as PUF protein) can bind to regulatory elements in the 3′ UTR of mRNAs and control mRNA stability, translation and localization [57]. The genes with the insertion of SINEs into exons are mainly annotated in terms of metabolism, cell division, signal transduction and transportation, and it remains to be elucidated whether some of the SINE insertions have an influence on gene expression.

HTT of SINEs

Increasing evidence showed that HTT is a common phenomenon. So far, no less than 5689 HTT events have been recorded [58]. However, only a few HTT events of SINE have been detected, including the SmaI-cor SINE between coregonid and common ancestor of salmonid (Hamada et al. 1997), Sauria SINE between reptiles and mammals [59], HaSE2 SINE between Aphis gossypii and Lepidoptera insects [33]. The long-term vertical inheritance property inherent in SINE and its dependence on active partner LINEs to move in new hosts may be the reason why HTT events rarely occur [47, 60], as was confirmed by the partial congruence between the phylogenetic trees of PxSE3 families and host species in this study. Interestingly, SlNPVSE1, a SINE copy inserted into the baculovirus, shared more than 90% identity to the consensus sequence of SfSE1, SlittSE1 and SlituSE1 (Additional file 4: Figure S4B). In addition, the absence of target site duplication as well as upstream host sequence in SlNPV-II, suggested that non-homologous end-joining of double-strand breaks might be the mechanism of HTT. SlNPV can successfully infect S. litura and S. exigua [61]. S. exigua multicapsid nucleopolyhedrovirus (SeMNPV) DNA can also replicate in five non-permissive cell lines including SF21AEII, CLS-79, SpLi-221, hi-5 and BmN4 [62], indicating a wider host range of NPV. Thus, our finding suggests the occurrence of HTT of PxSE1 between baculovirus and Lepidoptera insects. This is not surprising, because population genomics supported baculoviruses as vectors of horizontal transfer of insect transposons [63]. Similarly, the HTT of Helitron transposon Hel-2 and Tc1-like transposon TCp3.2 between insects and associated baculoviruses has been detected [64, 65]. Recent studies have revealed that the occurrence of HTT generally exhibits species ecological relationships, such as host-parasite [66, 67] and predator-prey [68, 69]. Additionaly, proviruses have been reported as vectors for HTT of Sauria SINE from reptiles to mammals [59]. Hence, it is necessary to further explore the HTT events of PxSE1-like elements mediated by baculoviruses.

Conclusions

In this study, we identified three tRNA-derived SINEs and two 5S RNA-derived SINEs in the genome of P. xylostella, among which PxSE1 is a relatively young retrotransposon and was generated by recent explosive amplification. Homology searches revealed scattered distribution of these elements in other Lepidopteran insects with variable copy numbers. The preference of PxSINEs to insert or accumulate in introns of genic regions indicated that P. xylostella SINE families contribute to structural variation in introns. The identification of PxSE1-like elements in the baculovirus and related lepidopteran host insects provides evidence of horizontal transfer facilitated by host-parasite interactions. These data may have implications for understanding the evolution and HT mechanisms of SINEs.

Methods

Data resources

The 235 publicly available insect databases of WGS assemblies including 33 Lepidoptera insects, EST, nucleotide (Nr/Nt), and TSA from National Center for Biotechnology Information (NCBI) (last accessed November 30, 2018) were used in this study (Additional file 12: Table S2). P. xylostella WGS was downloaded from NCBI [52]. As corresponding gene annotation file, the GFF files GCF_000330985.1 were used.

Database search strategy

To identify SINE candidates, database searches were performed and composed of four steps. Firstly, the known SINE sequences, including tRNA-derived HaSE1 from Helicoverpa armigera [33] and BmSE from B mori [28], 5S rRNA-derived HaSE3 from H. armigera [33], were used as queries for local blastn in the DBM genome. The sequences of high homology (at least 70% identity over at least 50 bp length to query) as well as 500 bp upstream and downstream flanking regions were extracted using TBtools [70] and analyzed for conserved structural motifs of SINEs such as internal RNA Pol III promoter and TSDs. The consensus sequences of PxSE1 and PxSE4 were determined by multiple sequence alignments. Secondly, the consensus sequences of PxSE1 and PxSE4 were searched against DBM genome by local BLASTN to identify other potential homologous sequences, and two other tRNA-derived PxSE2 and PxSE3 and a 5S rRNA-related PxSE5 were identified. Thirdly, the 50-bp fragment at 3′-end of SINE families was used as query to search potential partner LINEs, and the LINE, PxLINE1, related to PxSE5 was identified. Finally, insect genome databases as well as EST, Nr/Nt and TSA databases from NCBI were searched using consensus sequences of these five SINE families as queries to detect SINEs in species other than DBM.

Copy number estimation

To estimate copy number and average divergence of SINEs, respective consensus sequences were used to search against related databases (Additional file 12: Table S2). All contiguous sequences with at least 80% identity at the nucleotide level to the consensus over 100 bp were used to estimate copy number in all species [71, 72]. Given the high sequence identity of 5′-ends in several copies of different SINE families in DBM, all those undistinguishable copies were ruled out. For example, PxSE1 and PxSE2 shared high identity of 120 bp sequence at 5′-ends, thus all copies aligned only with part or all of this 120 bp region in the consensus sequence were excluded for copy number analysis. Further, all fragments sharing at least 80% identity over at least 80% of the length of the consensus sequence were aligned and used for calculation of average divergence to consensus sequence with Kimura-2 parameter model [73]. The identity value of single copy to consensus sequence was rounded to an integer for the relative age distribution analysis [53].

Gene association and genomic show cases

The association of DBM SINEs with annotated genes were investigated using custom Perl script from MapGene2Chrom (http://mg2c.iask.in/mg2c_v1.0) [74]. The integration of SINEs into genic regions including introns, coding and untranslated regions as well as the distances of intergenic copies to the closest neighboring gene were determined as described previously [53]. The number of SINEs within each region was counted and the results were graphically represented using MapGene2Chrom.

Sequence analysis and phylogeny

SINE’s tRNA-like structure was checked with tRNAscan-SE [75], using mixed model and cove score cut off value = 0.01 as default. Multiple SINE copies were aligned by MUSCLE [76], and the alignments were visualized with GENEDOC (www.psc.edu/biomed/genedoc). The phylogeny of full consensus sequences of SINE families was built by MEGA 7.0 using Maximum Likehood with K2 + G model [77]. The reliability of the trees was tested using 1000 bootstrap replications [71].