Background

Chromosomal mutations, such as polyploidization and chromosomal rearrangement, can lead to speciation, adaptation, and diversification [1,2,3,4,5]. Extant species are ancient polyploids from a common ancestor that experienced at least one whole-genome duplication (WGD) [6]. Eudicots core to their clade descended from an ancient whole-genome triplication event (γ) [7]. Chromosomal evolution influences the development of chromosomal size, structure, composition, and number of chromosomes [8]. Karyotype evolution will cause the chromosomal structure to be unstable, such as fusion and fission regions caused by rearrangement, as well as centromere regions that increase or disappear due to WGD or chromosome fusion [9]. Transposable elements may fill and stabilize these unstable regions in the chromosomes [10]. Therefore, reconstructing the ancestor karyotype and analysing the distribution of transposable elements are crucial for untangling the species local adaptation and speciation.

Previous approaches for ancestral karyotype reconstruction and projection defined contiguous ancestral regions based on collinearity among genomes. This method results in gaps in the projections and reveals unrefined karyotype details [11,12,13]. Based upon the assumption that ancestral chromosomes remain in contemporary genomes, a new method has been proposed to search shared intact chromosomes or chromosome-like syntenic blocks to construct a gap-less ancestor karyotype projection [14]. The newly constructed ancestral eudicot karyotype (AEK) and ancestral core eudicot karyotype (ACEK) would provide a better model for karyotype projections of modern species, and inform further research into the evolutionary history of Kingdom Plantae [15].

Along with polyploidization, amplification of transposable elements (TEs) is a primary form of mutation affecting the structure, function, and evolution of chromosomes [16,17,18,19,20]. Long terminal repeat retrotransposons (LTR-RTs) are major components of TEs in plant genomes, accounting for > 70% of the nuclear genomes of maize [21], tea [22], and rye [23]. However, their abundance and composition vary across species due to genome size and LTR-RT amplification [24, 25]. According to the positions of integrase (INT), LTR-RTs can be divided into Ty1/Copia and Ty3/Gypsy superfamilies and different lineages [26, 27]. The abundance of LTR-RTs specific-lineages is considered one of the important factors affecting species adaptation [28]. LTR-RTs spread throughout the genomes by retrotransposition (a copy/paste mechanism) during species evolution, which causes LTR-RTs amplification, genome expansion, and chromosome rearrangement [29, 30]. The LTR-RTs amplification contributes to chromosomal structure, centromere function, and regulation of gene expression [31,32,33]. Therefore, exploring the abundance, distribution, and evolutionary dynamics of LTR-RTs helps explain the molecular mechanism of chromosomal structural variation and evolutionary processes in genomes.

Quercus (oak), the largest genus in the family Fagaceae, is widely distributed in the Northern Hemisphere, including Asia, Europe, Africa, and the Americas [34]. As an important ecological and economic tree in East Asia, oaks are famous for their environmental adaptability, resistance to biotic and abiotic stresses, and providing many biological materials [35, 36]. Currently, eight chromosome-level oak genomes have been sequenced and annotated, including Q. acutissima [37], Q. dentata [38], Q. gilva [39], Q. glauca [40], Q. lobata [41], Q. mongolica [42], Q. robur [11], and Q. variabilis [43]. These provide a comprehensive database for analysing the genomic and chromosomal evolution of the genus. The evolutionary history and phylogenetic relationships of Quercus are well-established using high-quality nuclear and chloroplast genomes [44, 45]. The genus dates back to approximately 55 Ma (millions of years ago), and there have been no significant levels of chromosome fusion or species-specific WGD events [46,47,48,49].

Previous comparative genomics research on Quercus mainly concentrated on analysing interspecies genomic collinearity, phylogenetic relationships, and demographic dynamics. The karyotype evolution and LTR-RT diversity of oak species remain unknown so far. Understanding the karyotype evolution and LTR-RT distribution is important for a comprehensive and objective view of the oak evolution. Here, based on the high-quality oak genome sequencing, we aim to (I) reveal the chromosomal evolutionary history, (II) investigate intergeneric variation and evolutionary dynamics of LTR-RTs, (III) explore the influence of LTR-RTs insertion on gene regulation, chromosomal structure, and centromere functional. This study also provides a case for exploring species adaptation evolution and speciation from the perspective of karyotype and LTR-RT evolution.

Results

Chromosomal evolution of oaks

To infer the chromosomal evolution of Quercus (2n = 24), ancestral karyotype projections were reconstructed using the AEK as a reference. Synteny blocks and gene pairs between contemporary oak genomes and the AEK were, respectively, 306–505 and 5,929 − 15,865 (Table S2). Quercus chromosomes 1 and 7 have a conserved synteny relationship with AEK 6, and the other ten chromosomes exhibited a fusion of synteny blocks with fragmented ancestral chromosomes (Fig. 1 and S1). For example, Quercus chromosome 2 showed the synteny relationship with at least 4 AEK chromosomes and 11 fragments. Homologous gene dot-plots and the karyotype projections between Quercus and the ACEK were completed to explore the impact of ancient triplication events on the karyotype evolution (Figs. S2 and S3). Synteny blocks and gene pairs between oak species and the ACEK were 782-1,107 and 15,865 − 20,446, respectively (Table S2). ACEK chromosomes 3, 4, 6, 10, 13, and 15 were intact and preserved in the Quercus genomes. Other ACEK chromosomes were preserved as fragments in different Quercus chromosomes. Through chromosome arm exchange, for example, the ACEK 7 is preserved in the Quercus chromosomes 6 and 10. Intra-chromosomal rearrangements, such as the inversion of ACEK 5 on Quercus chromosome 11 and ACEK 7 on Quercus chromosome 6, indicated complex chromosome variation during oak evolution.

Fig. 1
figure 1

Karyotype (AEK) projections for eight oak species

To examine the ancestral chromosomal changes in oak species, we clarified the evolution of Q. glauca chromosomes. A total of six chromosomes were fused by two ACEK chromosomes. Quercus glauca chromosome 2 was fused by four ACEK chromosomes (ACEK 3, ACEK 11, ACEK 18, and ACEK 19; Fig. 2). Chromosome arm exchanges were observed in several chromosome pairs, such as 4 and 8, 6 and 10, and 9 and 12. After multiple chromosome fusions and arm exchanges, the chromosome number of Q. glauca remained stable. A total of 907-1,427 synteny blocks and 19,027 − 23,531 gene pairs were identified between Q. glauca and the other seven oak genomes (Table S2). Homologous gene dot-plots detected species-specific chromosomal rearrangements, such as chromosomes 1 and 7 of Q. gilva and chromosomes 4 and 11 of Q. dentata (Fig. S5). An inversion of approximately 5.6 Mb in chromosome 3 was unique to Q. glauca. Another inversion of chromosome 3 at ca. 7.6 Mb to 10.7 Mb and ca. 56.7 Mb to 82.4 Mb were unique in Q. lobata.

Fig. 2
figure 2

Evolution of modern chromosomes in Q. glauca. Arrows indicate ancestral pieces fused into one chromosome. Black boxes refer to the reference chromosomes to construct the modern karyotype of Q. glauca

Evolution of full-length LTR-RTs in Oaks

To explore the evolution of oak chromosomal structure, a total of 441,449 full-length LTR-RTs were identified in the eight genomes, including 22,579 Ty1/Copia (51.1%), 16,344 Ty3/Gypsy (37.0%), and 5,226 designated as Unknown (11.9%; Table S3). The densities (average number per Mb genome) of LTR-RTs in oak species varied from 4.6 (Q. dentata) to 8.6 (Q. glauca), and the cumulative length from 33.8 Mb (Q. dentata) to 56.8 Mb (Q. gilva; Table S3). The number of solo LTRs in the oak species varied from 83,118 (Q. robur) to 152,408 (Q. dentata), and the cumulative length from 93.2 Mb (Q. mongolica) to 136.7 Mb (Q. dentata; Table S3). The number of full-length LTR-RTs was variable between oak species, ranging from 4,102 (Q. dentata) to 7,455 (Q. glauca; Fig. 3a). The genomic content masked by LTR-RTs ranged from 3.8% (Q. dentata) to 7.1% (Q. glauca; Fig. 3b). In all oak species, Copia types were more abundant than Gypsy, and the average length of Gypsy types was larger than that of the Copia and Unknown (Fig. 3c).

Fig. 3
figure 3

Full-length LTR-RTs number, average length, and proportions across eight oak species. Blue represents the Copia; Green represents the Gypsy; Orange represents the Unknown. a Number of Copia, Gypsy, and Unknown were detected in eight oak species. b Genome proportion of Copia, Gypsy, and Unknown of each species. c The average length of the full-length LTR-RTs in eight oak species

The transposition time of LTR-RTs was estimated to be within the last 8 Ma (Fig. 4a). Four oak species (Q. gilva, Q. glauca, Q. mongolica, and Q. robur) showed more recent amplification within the last 0.2 Ma. Recent amplification of LTR-RTs in Q. dentata (about 0.8 Ma) was more ancient than the above four oak species. Differences in LTR-RTs amplification in oak species were mainly due to the difference in the insertion time of Copia (Fig. 4b, c and S6). Five species (Q. dentata, Q. gilva, Q. glauca, Q. mongolica, and Q. robur) showed dramatic Copia amplification. Two species (Q. acutissima and Q. variabilis) from section Cerris showed that the insertion time of Gypsy was more recent than that of Copia.

Fig. 4
figure 4

Evolution and diversity of LTR-RTs lineages in each oak species. a The estimated insertion time of all full-length LTR-RTs (MYA, millions of years ago); b The estimated insertion time of Copia; c The estimated insertion time of Gypsy; d Phylogenetic trees constructed based on reverse transcriptase domain sequences. The different colors on the outer circle represent different species, and the different colors on the Branch represent different lineages

According to their RT protein domains, the Copia and Gypsy types were subclassified into nine and seven lineages, respectively (Fig. S7). In Copia, SIRE, Ale, and Tork lineages were most common, and Retand and Ogre lineages were most common in Gypsy. The maximum likelihood (ML) tree indicated that much species-specific amplification occurred for several lineages in different species (Fig. 4d). The Copia/Ale lineages were amplified relatively ancient in Q. dentata (Fig. S8a). The Copia/SIRE lineages showed an activity burst in five oak species (Q. dentata, Q. gilva, Q. glauca, Q. mongolica, and Q. robur), and the burst of Q. dentata was more ancient than other species (Fig. S8b). The Copia/Angela lineages were only abundant in four oak species (Q. acutissima, Q. gilva, Q. glauca, and Q. variabilis), and a recent more active burst in Q. glauca was due to the amplification of Copia/Angela lineages (Fig. S8c). The Gypsy/Retand lineages showed variation in insertion time within all eight oak species (Fig. S8d).

Distribution of LTR-RTs in oaks

LTR-RTs are widely distributed in plant genomes through retrotransposition and may be inserted into the promoter or coding regions of genes. In oak species, we found more LTR-RTs inserted in promoter regions (293-1,772) than gene regions (302-1,495), except for in Q. lobata (Fig. S9 and Table S4). Inserted LTR-RTs suppressed gene expression, and the effect of inserts in the gene region was more significant than in promoter regions (Fig. 5a). This trend was consistent with LTR-RTs inserted in R-genes (Fig. 5b). Gene ontology (GO) analyses indicated that the LTR-RTs-associated genes showed various functions, such as metabolism, cell periphery, response to stress, gene regulation, and system development (Fig. S10). The Q. lobata was different from other oaks, mainly enriched in GO regulation of retrotransposon nucleocapsid (GO:0000943), transposition (GO:0032196), and DNA-directed DNA polymerase activity (GO:0003887). KEGG results showed that the LTR-RTs-associated genes were enriched in genetic information processing, transport and catabolism, signal transduction, translation, carbohydrate metabolism, and environmental adaptation (Fig. S11).

Fig. 5
figure 5

Impact of LTR-RTs on gene expression. a Gene expression levels of paralogous gene pairs with or without LTR-RTs insertion in three tissues of Q. glauca. p****<0.001. bR-gene expression levels of paralogous gene pairs with or without LTR-RTs insertion

To investigate the impact of LTR-RTs on chromosomal structure, we analyzed the distribution of genes, tandem repeats, LTR-RTs, and GC content in the Q. glauca genome (Fig. 6). The results showed that gene density, LTR-RT density, and GC content had regional special enrichment patterns. For example, the low gene density but high LTR-RT density and GC content were found in the 0-18.2 Mb region of chromosome 1, and chromosomes 4, 9, and 11 have multiple regions with similar characteristics. Chromosomal rearrangement regions in chromosomes 4, 8, and 11 have low gene density and high LTR-RT distribution.

Fig. 6
figure 6

Chromosomal distribution of genes, tandem repeat, LTR-RT density, and GC content. The red dots at both ends of the chromosome indicate the position of the telomere; The blue lines on the chromosome indicate the predicted position of the Centromere region

In most chromosomes of Q. glauca, there are regions with higher frequencies of LTR/Copia, LTR/Gypsy, tandem repeats, and GC content, but low-frequency gene density was consistent with the characteristics of the centromere region. Various methods were used to predict the potential centromere regions of Q. glauca. First, the enrichment of repeat units along the genome was detected (Fig. S13). A total of six centromere regions were found, and the repeat units varied among chromosomes. 32 bp repeat units were evident in the centromere regions of chromosomes 1, 5, 9, and 11, 79 bp in chromosome 2, and 78 bp in chromosome 4. Second, discontinuous signals in chromatin interaction heat maps were used to predict the potential centromere regions for each chromosome (Fig. S12a). Third, the analysis programs Centromics and the CentroMiner predicted six and twelve potential centromere regions, respectively (Table S5 and Fig. S12b, c). Based on the LTR-RTs distribution and the prediction methods, 12 chromosome centromeres were defined (Fig. 6). The six regions identified by repeat units were highly linked with the centromeres. Eight and six predicted regions correspond to the defined centromeres in the genomic discontinuous signals and CentroMiner results, respectively. In addition, we also used IGV to detect the repeat units in the miss-predicted centromere regions, which were caused by longer repeat units, such as the 367 bp long repeat units in chromosomes 1, 2, 8, and 10, as well as assembly gaps (Fig. S12d).

Discussion

The ancestor karyotype projection provides evidence for studying the evolutionary history of species by identifying collinear genes and their order [13, 15]. Previous ancestor karyotype projection studies contained undefined regions and only revealed limited karyotype dynamics [11, 12, 50]. This study utilized WGDI to identify the proto-chromosomes by searching for shared intact chromosomes or chromosome-like synteny blocks to complete gap regions [14, 15]. The ancestral karyotype projections of eight oak species from the four sections were established, elucidating the roles of chromosome fusion and arm exchange in the evolution of 12 modern chromosomes. This study could provide new insights into the impact of ancient whole-genome triplication events on karyotype evolution, the role of interspecies chromosome rearrangement in speciation, and the dynamics of oak chromosomal evolution.

As diploids, lineage-specific whole genome duplication events have not occurred in oaks [42]. By completing the gap regions in karyotype projections, the ancestral synteny blocks of all chromosome regions in oak species were defined, which contributes to exploring the differences in common ancestor and species-specific chromosomal evolution of oak species. The interspecific conserved synteny blocks exist between modern oak genomes and ancestral karyotypes from the same ancestor [51]. Previous research used the shared synteny blocks to explore the most intact chromosome as an ancestral proto-chromosome [12, 15]. However, complex rearrangement in oak genomes resulted in the distribution of shared synteny blocks within segments of several chromosomes, making it difficult to precisely explore the common ancestral proto-chromosome. Rearrangement occurs frequently in plant genomes and can promote the evolution of chromosome number, size, structure, and composition [8, 9]. After polyploidization, the following diploidization entails various chromosome rearrangements, such as inversions, translocations, fission and fusion, duplications, and deletions [52]. These events could contribute to the richness of structural diversity of the oak karyotype. Compared to the Betula pendula of Betulaceae, the evolution of AEK 6 in the Fagaceae and Betulaceae is relatively conservative, and their chromosomes have undergone complex rearrangements (Fig. S4) [15]. Chromosome rearrangement enriches the chromosomal structural diversity of these two widely distributed and ancient Fagales lineages and contributes to adaptive evolution. This study clarified the evolution of modern Q. glauca chromosomes and confirmed the important role of chromosome fusion and arm exchange in karyotype evolution. To elucidate the common ancestor and the specific details of karyotype evolution of oaks, it is necessary to analyze karyotype evolution based on representative genomes of other lineages [13, 15].

Identical and species-specific chromosomal rearrangements within oaks were shown in the ancestor karyotype projection and interspecific synteny relationships. In oaks, research has revealed the importance of natural hybridization and introgression in promoting genetic diversity and the generation of new species [53, 54]. Identical chromosomal rearrangements among oaks are associated with the evolution of Quercus’ common ancestor, and these rearrangements may have been preserved in frequent hybridization and have the effect of inhibiting recombination [55, 56]. Species-specific chromosomal variation enriched the lineage-specific diversity of chromosomal structure and contributed to the species reproductive isolation, speciation, and adaptive evolution [3, 9]. The accumulation of chromosomal rearrangements between species is largely incidental to speciation, and affects gene flow and fitness [55, 57]. For example, chromosomal rearrangements may cause postzygotic barriers or suppress the recombination of heterologous karyotypes, which could lead to speciation [58]. Some species-specific chromosome structural variation detected in this study were consistent with previous oak genome research [38,39,40, 49]. The species-specific inversion and translocation in chromosomes 3 and 5 of Q. lobata may be related to the ancient speciation and unique lineage evolution on the west coast of North America. The interspecific chromosome rearrangements appeared irregular among different sections, which could not provide direct evidence for divergence and speciation among oak species. Q. glauca and Q. gliva, from section Cyclobalanopsis, exhibited chromosome inversion in chromosomes 1 and 7, possibly related to speciation and habitat differences. Chromosome rearrangement undoubtedly enriches the diversity of oak karyotypes, and further research on rearrangement sequence should explore interspecific differences, stress resistance, and ecological adaptability in the oak species.

LTR-RTs and polyploidization promote adaptation and shape genomic structure [10]. The proportion of LTR in the oak species varied, ranging from approximately 139.1 Mb (17.2%) in Q. mongolica [42] to 371.3 Mb (46.6%) in Q. variabilis [43] (Table S1). Previous genomic studies on oaks focused on analyzing LTR-RTs content, with little further identification of intact full-length regions based on different lineages in the Copia and Gypsy subfamilies. According to conserved protein domains and the REXdb database [59], we identified intact full-length LTR-RTs from 33.8 Mb to 56.8 Mb when excluding some Unknown elements and solo LTRs. The amplification and depletion of LTR-RTs affect genome structure, size, and evolutionary rates [17]. Previous research on Fabaceae and Curcurbitaceae species has shown a significant positive correlation between LTR-RT content and genome size [24, 25]. Similar genome sizes but varying LTR-RTs densities in oaks imply that species-specific evolutionary histories could affected the richness of LTR-RTs across species. Several factors could contribute to the content of LTR-RTs, such as chromosomal rearrangement and solo LTRs content [60,61,62]. In oaks, Q. lobata, with species-specific chromosomal rearrangements, has fewer intact LTR-RTs and solo LTR, which may suggest that the genome maintained relatively stable after speciation. Two species with larger genome sizes, Q. glauca and Q. gilva, have more intact LTR-RTs and solo LTR, which may suggest rapid evolution in their genomes.

The LTR-RTs are sub-classified into different lineages in oaks, with SIRE and Retand accounting for most of the Copia and Gypsy subfamilies, respectively. Previous research [24, 25] found the scales and timeframes of activity amplifying LTR-RTs vary dramatically among families, lineages, and species [17]. In oaks, the Copia/Ale, Copia/SIRE, Copia/Angela, and Gypsy/Retand lineages exhibited varying amplification and evolutionary patterns. The amplification of different LTR-RTs lineages in the oak genome was a source of intraspecific polymorphism, which is considered an important factor affecting genomic diversity and adaptive evolution [63]. Although both Q. gilva and Q. glauca belong to the section Cyclobalanopsis, Q. gilva has more ancient amplification among the four lineages while Q. glauca shows recent independent amplification in Copia/Angela. Two species of section Cerris (Q. acutissima and Q. variabilis) showed more recent amplification in Gypsy. The different amplification/loss rates of LTR-RT specific lineages in oak species may imply a difference in the evolutionary rate of the sections and species [17].

Insertion of LTR-RTs into genomes impacts gene expression, regulation, and function, such as changing gene structure or the functional elements in the promoter region [25, 64,65,66]. Comparative transcriptomic analyses confirmed the suppression function of LTR-RTs inserted in Q. glauca genes, consistent with previous studies in Curcurbitaceae and Fabaceae species [24, 25]. In GO enrichment analysis, LTR-RT-associated genes in oaks were enriched in envelope and heterochromatin formation, which were related to SIRE and Retand amplification [67,68,69]. Meanwhile, the mutations caused by LTR-RT insertion may also affect phenotypes. For example, an LTR-RT inserted into the apple MdMYB1 gene will increase anthocyanidin accumulation and form red skin [70]. The LTR-RTs insertion in BoCYP704B1 is the primary cause of the male sterility in cabbage [71]. Therefore, the impact of inserted LTR-RT on gene expression regulation in oak genomes warrants further study.

Through integration and subsequent deletions, LTR-RTs are thought to facilitate subtle restructuring of chromosomal landscapes [9]. LTR/Copia and LTR/Gypsy were usually mixed with tandem repeats and enriched in plant centromere regions [60, 72, 73]. The pattern of 32, 78, and 79 bp repeat units are highly linked with the centromere regions of six chromosomes in Q. glauca, but Q. lobata has a consistent repeat unit (148 bp) for each centromere [49]. This result indicated that although the centromeres are conserved function across species, there is diversity in their structure and sequence [74]. The centromere region’s complex and highly repetitive structure often leads to collapse and truncation during genome assembly, which may mean we have not identified all centromeres [75]. During polyploidization and subsequent restoration to diploid, the centromere plays an important role in karyotype rearrangement and speciation [60, 76]. Some chromosomal rearrangement regions in Q. glauca exhibited unique patterns of LTR-RTs enrichment. The centromere tandem repeat units were also common in non-centromeres regions in the Q. glauca genome, which may be related to the centromere’s loss and formation after chromosome fusion and fission. However, whether ancient centromere repeats still exist in the modern genome and have special functions to maintain the stability of chromosomes remains a mystery [77]. Recent studies have proposed a new genome assembly method that can assemble a highly continuous and completely gap-free reference genome, allowing better identification of all centromere regions and exploring centromere evolution [78, 79]. This study can provide conditions for precise identification of the centromere regions in the oak genome to explore the variation between oaks and its impact on karyotype evolution.

Conclusions

This study revealed the effects of polyploidization and LTR-RTs amplification on oak genome structure, function, and evolution. We confirmed that after the ancient triplication event from AEK, the oak genomes decreased by nine chromosomes through fusion, fission, and rearrangement, reaching a stable state with 12 chromosomes in modern genomes. After speciation, recent LTR-RTs amplification in different lineages affected their composition and abundance variably in oak species. The insertion of LTR-RTs into genes partly suppresses gene expression. The distribution pattern of LTR-RTs combined with gene density, tandem repeat density, and GC content were used to identify centromere regions in the Q. glauca genome. However, in the long evolutionary history of oak species, clarifying the impact of ancestral karyotype evolution and LTR-RTs on genome amplification and chromosomal structural variation needs further verification using more high-quality genomes from related species.

Methods

Whole-genome duplication analyses and karyotype projection

Eight oak genomes were obtained from previous literature (Table S1), including two species, Q. gilva [39] and Q. glauca [40], from section Cyclobalanopsis; two species, Q. acutissima [37] and Q. variabilis [43], from section Cerris; four species, Q. dentata [38], Q. lobata [41], Q. mongolica [42], and Q. robur [11], from section Quercus. Chromosome sizes ranged from 750 Mb (Q. acutissima) to 893 Mb (Q. dentata) and gene numbers ranged from 30,820 (Q. acutissima) to 39,023 (Q. glauca; Table S1).

The projections of the ancestral eudicots karyotype (AEK) and ancestral core eudicots karyotype (ACEK) were reconstructed using WGDI v0.6.5 [14]. First, the protein sequences of the eight oak species were compared with the AEK and ACEK using BLAST v2.12.0 [80] with “-outfmt 6 -evalue 1e-5 -num_alignments 20” parameters. The script generate_conf.py (https://github.com/xuzhougeng/myscripts/blob/master/comparative/generate_conf.py) was used to obtain the gene location and chromosome information required by WGDI. Second, the “-icl” parameter in WGDI was used to identify collinear genes between the modern genomes and the two ancestral karyotypes, and “-bi -c -bk” parameters were used to integrate, filter, and check the synteny blocks. WGDI with the “-km” parameter was used to obtain the mapping results from AEK and ACEK to the oaks karyotype. Finally, homologous dot-plots between the modern genomes and the two ancestral karyotypes were plotted using WGDI, and the ancestral karyotype projections were visualized. The protein sequences of Q. glauca, the most complete genome among oak species so far, were compared with those of the other oak genomes using BLAST v2.12.0 [80] to identify the diversity in karyotype evolution and chromosomal rearrangement. Homologous dot-plots between Q. glauca and those of other oak species were plotted with the ACEK karyotype mapping results. CD-HIT [81] was used to remove redundant protein sequences with “-c 0.8 -aS 0.8 -d 0” parameters for further constructing phylogenetic trees. Then, OrthoFinder v2.5.4 [82] was used to identify orthologs and construct a maximum likelihood (ML) phylogenetic tree with the “-S diamond -M msa” parameters. We used “-M msa” for multiple sequence alignments (MSA) and used default parameters in MAFFT v7.515 [83] and FastTree v2.1.11 [84] to infer maximum likelihood trees.

LTR-RTs identification and annotation

We used EDTA v1.9.6 [85] (Extensive de-novo TE Annotator), a comprehensive process tool that integrates the results of several current LTR prediction tools, such as LTR_FINDER [86], LTRharvest [87], and LTR_retriever [88], to build a highly reliable non-redundant TE database, and annotated repeated sequences with RepeatMasker [89]. We used EDTA.pl with the “-species others -step all -anno 1 -sensitive 1” parameters to obtain the TE database for each oak genome. The protein domains of the elements belonging to different lineages of Copia or Gypsy superfamilies were analyzed using REXdb [27], which was implemented using TEsorter v1.2.5.2 [59]. The recombination caused by the disappearance of internal components will lead to the removal of intact LTR-RTs and the formation of solo LTRs [61, 62]. We extracted solo LTRs from the annotation file generated by the RepeatMasker in EDTA.

To explore LTR-RTs amplification and the disparity in evolution among oak species, we used the formula T = (1 - identity) / 2µ to calculate the transposition time of LTR-RTs, where identity represents the sequence similarity between 5’ and 3’ LTRs obtained from the EDTA analysis, µ represents the base substitution rate. The substitution rate 1.01 × 10− 8 of Q. lobate [49] is the oak substitution rate in this study. To investigate the historical dynamics of different lineages of Copia and Gypsy, we extracted RT protein domain sequences of diverse lineages in these superfamilies by the concatenate_domains.py script in TEsorter [59]. After sequence alignments were carried out using MAFFT v7.515 [83], ML phylogenetic trees were constructed and visualized using FastTree v2.1.11 [84] and iTOL [90], respectively.

LTR-RTs associated with genes

We analyzed the number and function of genes that overlap with LTR-RTs. The LTR-RTs overlapping with gene and promoter regions were calculated using the “intersect” function from BEDtools v2.30.0 [91]. Protein sequences of the gene and promoter regions overlapping with LTR-RTs were extracted. GO enrichment analysis of extracted genes was carried out using the eggNOG-mapper [92] online tool and the R package ClusterProfiler [93]. The metabolic pathways were annotated with KAAS [94] and visualized with R package ggplot2 [95].

We used transcriptome data from the leaf, inflorescence, and stem of Q. glauca from the NCBI SRA database (BioProject: PRJNA868092) to evaluate the impact of LTR-RTs on the expression of adjacent genes. Hisat2 v2.2.1 [96], Samtools v1.13 [97], and StringTie v2.2.1 [98] were used to compare transcriptome data to the reference genome, sort and index sam files, and obtain the read count. Gene expression level was quantified in TPM (transcripts per million). Paralogous genes were detected using BLAST v2.12.0 [80]. Expression levels of paralogous genes with and without overlapping LTR-RT were compared. We further analyzed the impact of LTR-RTs insertion on the expression level of resistance genes (R-genes), as the evolution of R-genes is widely considered to be affected by LTR-RT insertion.

LTR-RTs distribution

LTR/Copia and LTR/Gypsy were usually mixed with tandem repeats and enriched in plant centromere regions. Combined with previous research [79, 99], we used Q. glauca as a reference to scan the regions with a higher frequency of tandem repeat, LTR/Copia, and LTR/Gypsy distribution and also a higher GC content but low-frequency gene density. The densities of genes, tandem repeats, LTR/Copia, and LTR/Gypsy were calculated using BEDtools v2.30.0 [91] with parameters “-w 1000000 -s 200000” to make interval “windows” and “-counts -F 0.5” to compute the coverage. The GC content of the Q. glauca genome was calculated by seqkit [100] tools with the same sliding window size. The R scripts completed data visualization.

To predict potential centromere regions, we first used the Telomeres_and_Centromeres [99] method to detect the tandem repeats (TRs) by TRF v4.09.1 [101] software with the “2 7 7 80 10 50 500 -f -d -m” parameters, and TRF2GFF (https://github.com/Adamtaranto/TRF2GFF) was used to merge the annotated results. Then we screened high-frequency repeat units in each chromosome, using IGV v2.16.1 [102] to visualize the density of genome annotation, LTR-RTs, and repeat units. Potential centromere regions showed low-frequency peaks of genome and TE and high-frequency peaks of repeat units in IGV. Second, Juicebox v1.11.08 [103] was used to observe the Hi-C heat map of the Q. glauca [40] genome. Third, Centromics (https://github.com/zhangrengang/Centromics) and the CentroMiner tools of quarTeT v1.1.1 [78] were used default parameters to predict the potential centromere regions.