Introduction

Theaceae, which is included in the order Ericales, is one of the families with higher biodiversity among angiosperms, with nearly 370 accepted species, including many economically, ecologically, and horticulturally important species [1]. Stewartieae (Stewartia and Hartia) diverged earliest from Theaceae approximately 20.78 million years ago [1]. Stewartia gemmata (S. gemmata) is a member of the genus Stewartia, predominantly found in the southern provinces of China, such as Hunan, Jiangxi, Fujian, Guangdong, and Yunnan, flourishing in mixed forests at altitudes ranging from 900 to 1,500 m. This species typically reaches a height between four and eight meters, characterized by its smooth and grayish-yellow bark. S. gemmata is frequently used for ornamental purposes in horticulture due to its high decorative value as vibrant flowers with aesthetic appeal. Utilized in traditional medicine, the bark, roots, and fruits of this plant offer substantial medico-economic value [2]. As an early-diverging species in Theaceae, exploration of the organelle genome of S. gemmata will provide an opportunity to unravel the reticulated evolution history caused by hybridization and supplement the genetic resources of Theaceae species. Investigations with the early diverging clade representative will hold significant implications for understanding tea plant family species relationships and the conservation of genetic resources.

Recently, molecular phylogeny supported that Theaceae was mainly divided into three tribes, Gordonieae, Stewartieae, and Theeae [3, 4]. The published genome species in Theaceae were concentrated in Camellia sinensis and oil-tea plants [5,6,7,8,9,10]. Due to the lack of genomic data in other clades, such as the most divergent clade, the phylogeny and other evolutionary studies within Theaceae are still subject to debate. For example, a previous study based on the matK and rbcL gene markers supported Gordonieae and Stewartieae as sister groups with weak support [3]. A study based on nrDNA ITS sequences indicated that Gordonieae is the sister group to Theeae [4]. The above phylogenetic studies are based on single or partial loci of the plastid genome with limited phylogenetic informative sites that could not yield a high resolution for tea plant family phylogeny. A recent study based on extensive nuclear gene markers successively showed that Stewartieae and Gordonieae are consecutive sister groups of Theeae, with the earliest divergent clade being Stewartieae, followed by Gordonieae and Theeae [1]. Nevertheless, due to the absence of high-quality genome data for the Stewartieae and Gordonieae tribes, some evolutionary questions have been difficult to answer.

In addition to nuclear genomes, plant organelles possess independent genetic material [11]. Similar to the nuclear genome, they have multiple mechanisms for defense against genotoxicity, DNA repair, and maintenance of genome integrity, which play an extremely important role in the ordinary life activities of plants [12]. The mitochondrial genome is much larger and structurally variable, and the chloroplast genome is unique to plants compared to animal organelles [13]. Mitochondria play a crucial role in cellular energy conversion as well as supply processes. They are present in almost all cells, and the genome size is remarkably diverse in green plants, ranging from 20 kb in green algae to millions of bases in some angiosperm species [14, 15]. The mitochondrial genome contains a large number of exogenous sequences from the nuclear and chloroplast genomes [16], as well as a large number of repetitive sequences. In addition, the large number of repetitive sequences may lead to frequent homologous recombination events [17], resulting in a range of genomic structures. These structures mainly include linear, circular, and multibranched subcyclic complex molecular conformations [18, 19]. Chloroplasts play an indispensable role in photosynthesis and the carbon cycle and are responsible for the conversion of light energy into biomass [20, 21]. The structural composition consists mainly of a large single-copy region (LSC), a small single-copy region (SSC), and two inverted repeat regions (IRs) [22], with the size of the genome ranging from approximately 107 kb in Cathaya argyrophylla to approximately 218 kb in Pelargonium [22].

Advancements in sequencing technologies, along with the widespread use of third-generation sequencing platforms such as PacBio and Nanopore, have significantly improved the precision of genomic assembly [23, 24]. Technological progress has provided a feasible context for resolving conflicts in phylogenomic studies based on extensive gene marks from organellar or nuclear genomes [25]. Owing to the higher copy number of organellar genomes in plant cells and the small genomes, it is easy to obtain more sequences with high sequencing depth at a relatively lower cost [22, 25]. Specifically, the chloroplast genome, with its matrilineal inheritance pattern, conservative genomic features, prevalence of single copies and smaller genomic size, provides significant advantages for phylogenetic research [26]. The evolutionary relationships of the early-diverging clade of angiosperms have been constructed based on 61 plastid genomes, showcases a practical and efficient application of the technology [27]. Subsequently, the phylogenetic relationships of Viridiplantae have been successfully reconstructed by 360 well-documented plastid genomes, providing another outstanding example of utilization [25]. With the progression of organellar phylogenomic research, there has been a consistent expansion and diversification of studies centered primarily on mitochondrial phylogeny [28, 29]. These studies are increasingly being utilized to clarify species phylogeny, as demonstrated in examples such as Ganoderma (reishi mushroom) [30], Orchidaceae (the orchid family) [31], and Saposhnikovia divaricate [32].

S. gemmata, an ancient species within Theaceae, possesses substantial economic and medicinal value. As of now, there remains a deficiency in the high-quality assembly and annotation of organelle genomes for representative species within the most early-diverging lineage of the tea plant family. In this study, we assembled and annotated high-quality organelle genomes for S. gemmata. It facilitated an exhaustive identification of codon bias within the encoding genes, repetitive sequences, and transposable elements (TEs), subsequently elucidating the structural characteristics of the Stewartia organelle genomes. In this study, a comparative analysis was conducted on the organelle genomes, along with an examination of the phylogenetic relationships among these species within the Theaceae family. The release of both mitochondrial and chloroplast genomes of S. gemmata will not only enrich the existing organelle genomic resources but also facilitate comprehensive investigations into various facets of the Theaceae family. The acquisition of high-quality organelle genomes from representatives of the earliest diverging lineage within the Theaceae family will substantially contribute to the understanding of the biology of the tea plant family. This value dataset will provide novel insights into the genomic characteristics, phylogenetic relationships, and evolutionary history of Theaceae species, thereby laying an important foundation for further scientific investigations and analyses.

Result

De novo assembly and annotation of S. gemmata organelle genomes

The high-quality organelle genomes of S. gemmata were assembled and annotated using a combination of next-generation and third-generation sequencing technologies (Fig. 1). The chloroplast genome manifests a typical monocyclic structure, while the mitochondrial genome exhibits a more complex conformation, encompassing one principal circular structure and two plausible alternative circular conformations (Supplementary Fig. 1). The mitochondrial conformation collectively contains three contigs, specifically contig1 (452.78 kb), contig2 (224.68 kb), and contig3 (1.86 kb), with contig3 as a repetitive sequence (Supplementary Fig. 1A). These contigs collectively signify the complete S. gemmata mitochondrial genome sequence. Targeting repetitive regions that may produce multiple assembly conformations, we mapped nanopore platform generated long-read sequences to these regions to validate the possible existence of different mitochondrial genome conformations. We discerned that two conformation pathways were supported by most of the long reads (Supplementary Table 1). The 'contig1-contig3-contig2-contig3' pathway constitutes a singular circular molecule (Supplementary Fig. 1B). The 'contig1-contig3-contig1' and 'contig2-contig3-contig2' pathways represent two potentially existing closed circular alternate molecules, possibly instigated by the repetitive sequence contig3 (Supplementary Fig. 1C). Given that both conformation pathways comprise all genomic information, we defined the singular circular molecule as the primary ring (681,203 bp) for the subsequent analysis (Supplementary Table 2).

Fig. 1
figure 1

Cartographic representation of the organelle genomes of S. gemmata. A and B represent the mitochondrial genome and the chloroplast genome, respectively. Consistent color blocks are indicative of similar functional gene clusters. The direction of transcription is designated by arrows within the figure: external transcription follows a counterclockwise trajectory, while internal transcription adheres to a clockwise path. In the inner circle bar graph, the dark color indicates GC content, and the light color indicates AT content. In subfigure A, the thick line in the inner circle indicates the two inverted repeats (IRa and IRb, 26,377 bp), which divide the genome into a small single-copy (SSC, 18,136 bp) region and a large single-copy (LSC, 87,516 bp) region

The mitochondrial genome of S. gemmata was annotated, yielding a total of 61 genes (Supplementary Table 3). The genomic analysis revealed a total of 38 distinct protein-coding genes within the mitochondrial genome, which comprises 24 core genes essential for mitochondrial function and an additional 14 genes that, while not classified as core, contribute to the broader mitochondrial gene repertoire. In the genomic architecture, we have identified a total of 20 tRNA genes, among which two are present in multiple copies, providing insight into the complexity of the organism's translational machinery. Furthermore, the genome encompasses three distinct rRNA genes, essential components of the ribosomal subunits, which play a crucial role in protein synthesis. This configuration of tRNA and rRNA genes underlines the intricacies of the organism's genetic blueprint for RNA-mediated processes. The core genes of the mitochondrial genome of S. gemmata comprise five ATP synthase genes, including atp1, atp4, atp6, atp8, and atp9; nine NADH dehydrogenase genes, namely, nad1, nad2, nad3, nad4, nad4L, nad5, nad6, nad7, and nad9; four cytochrome c biogenesis genes, specifically ccmB, ccmC, ccmFC, and ccmFN; and three cytochrome c oxidase genes designated cox1, cox2, and cox3. Additionally, the core genes include a membrane transport protein-encoding gene (mttB), a maturation enzyme-encoding gene (matR), and a ubiquinol-cytochrome c reductase gene (cob). The noncore genes include four ribosomal large subunit genes (rpl2, rpl5, rpl10, rpl16), eight ribosomal small subunit genes (rps1, rps3, rps4, rps7, rps12, rps13, rps14, rps19), and two succinate dehydrogenase genes (sdh3, sdh4) (Fig. 1A).

The chloroplast genome of S. gemmata is a single circular molecule of 158,406 bp with a GC content of 37.26%. The annotation of the chloroplast genome led to the identification of a total of 114 genes (Fig. 1B and Supplementary Table 4), including 80 unique protein-coding genes (eight of which are multiple copies), 30 transfer RNA (tRNA) genes (seven of which exist in multiple copies), and four ribosomal RNA (rRNA) genes (all of which exist in multiple copies). The protein-coding genes belonged to 16 distinct gene families, including 11 NADH dehydrogenase subunit genes, five photosystem I subunit genes, 16 photosystem II subunit genes, six cytochrome b/f complex subunit genes, six ATP synthase subunit genes, and a single large subunit gene for ribulose-1,5-bisphosphate carboxylase/oxygenase. Additionally, there are four DNA-dependent RNA polymerase genes, nine large ribosomal subunit genes, 12 small ribosomal subunit genes, a single maturation enzyme-encoding gene, a cytochrome c biogenesis enzyme-encoding gene, a membrane protein-encoding gene, and a protease gene. There is also one gene for the subunit of acetyl-CoA carboxylase, another one for a translation initiation factor, and four for conserved open reading frames.

To assess the quality of organelle genomes, we mapped long-reads and short-reads to the assembled genome and counted the coverage depth of genomic nucleic acid loci. In our study, the mitochondrial genome and chloroplast genome reached a sequencing coverage depths of 300X and 3,200X for short-reads and long-reads, respectively, with 100% genome-wide coverage (Supplementary Fig. 2).

Variation in organelle genome composition and mutation rate analysis of Theaceae species

We compared the organellar genome of S. gemmata with six other Theaceae species and two species (Aegiceras corniculatum and Diospyros oleifera) from closely related families within the order Ericales. In the mitochondrial genome, the gene composition of S. gemmata differs significantly from that of other species. We observed two main differences in the mitochondrial genome of S. gemmata compared to other species: first, only a small number of genes possess duplicated copies (Supplementary Fig. 3); second, there is evidence of gene loss and gain, particularly in tRNA genes. The substantial variability observed in the genes among these species implies a complex history of gene loss and gain events throughout Theaceae mitochondrial genome evolution. The gene compositions across the species examined in the chloroplast genome were relatively conserved (Supplementary Fig. 4). However, there also seems to be a pattern of tRNA gene gain and loss in certain species, further demonstrating the dynamic nature of organellar genome evolution.

In addition, to further resolve the evolutionary rate differences in the organelle genomes of Theaceae, we analyzed the evolutionary rates of protein-coding genes shared by the species of Theaceae in the phylogenetic tree. We found that the dN and dS values of genes in chloroplasts were greater than those of mitochondrial genes (Fig. 2A-B), which may be due to different evolutionary patterns. From the trend of dN/dS, the value of chloroplasts is approximately 1/2 of that of mitochondria, suggesting that the chloroplast genome may have been subjected to a higher degree of negative selection and is more conserved at the protein sequence level (Fig. 2C).

Fig. 2
figure 2

Graphical representation of dN, dS, and dN/dS values of paired orthologous genes in the organelle genomes of seven species of Theaceae. A, B, and C show the density distribution of the values of dN and dS and the ratio of dN/dS, respectively. The numbers in the graph indicate the peak value. Red indicates the chloroplast genome (Cp), and blue indicates the mitochondrial genome (Mt)

Characteristics of codon usage bias of protein-coding genes in the S. gemmata organellar genome

To investigate the codon usage bias characteristics of the S. gemmata organelle genomes, we analyzed the relative synonymous codon usage (RSCU) values of all PCGs (Fig. 3A-B, Supplementary Table 5). A certain degree of variation was observed in the frequency of codon usage between the different organelles. The range of RSCU values in the chloroplast spanned from 0.32 (CGC in arginine) to 1.99 (UUA in leucine), whereas in the mitochondria, the range was from 0.46 (CAC in histidine) to 1.58 (GCU in alanine). Only AUG (start codon) and UGG had RSCU values of 1 in both organelles, and the rest were either greater than or less than 1, suggesting a general codon preference for PCGs. There are 31 codons with RSCU values greater than 1 in the mitochondrial and chloroplast genomes, of which only three end in G/C and the rest end in A/U, suggesting that there is a strong A/U tendency in the use of codons in the genomes of S. gemmata organelles.

Fig. 3
figure 3

Summary of codon characteristics of protein-coding genes in the S. gemmata organelle genome. A and B represent the relative synonymous codon usage (RSCU) preference distribution maps for the mitochondrial and chloroplast genomes. The horizontal coordinates represent the 21 amino acids, and the vertical coordinates are the values of RSUC. C and D indicate a projection of ENC against GC3s based on protein-coding genes in chloroplasts and mitochondria, respectively. The solid line within the figure signifies the expected trendline of gene positions given codon usage solely driven by GC3s composition. Purple dots indicate genes related to photosynthesis (C) and respiratory chain (D), red dots indicate transcription and translation-related genes, and green dots indicate other genes

Scatter plots of ENC-GC3 are frequently employed to estimate factors influencing codon usage patterns [33]. We unveiled that the two organellar genomes exhibit similar bias patterns, with a minority of genes positioned on or close to the standard curve, while the majority demonstrate deviations from this curve (Fig. 3C-D). Remarkably, the genes exhibiting pronounced deviations in chloroplasts predominantly pertain to photosynthesis (Fig. 3C), while those in mitochondria are primarily associated with respiration (Fig. 3D). Existing research suggests that codon bias in genes with effective codon counts (ENC) less than 35 are more susceptible to mutation, whereas genes with effective codon counts greater than 35 are more likely to be the result of artificial or natural selection [34]. We found that the mitochondrial genes atp9, trnI, and trnW have ENC values of 33.99, 28.5, and 22.45, respectively, all less than 35. Among them, atp9 also had values less than 35 in other reports [30]. In the chloroplast genome, genes with ENC values under 35 included psbj (33.97), psbl (26.15), petN (30.7), trnR-ACG (32.95), trnF-GAA (23.83), and trnW-CCA (22.45) (Supplementary Table 6).

Identification of putative RNA editing sites of PCGs in the mitochondrial genome

RNA editing plays an essential role in the modulation of genes in plant mitochondrial genomes [35]. To obtain more precise results, we retained potential RNA editing sites with probability values greater than 0.9. We discerned 570 potential RNA editing sites in 38 mitochondrial PCGs of S. gemmata (Fig. 4), more than previously identified in Arabidopsis thaliana (441) and Brassica (427) mitochondrial genomes [36]. The putative editing sites all involved a base change from C to U, consistent with earlier studies [35, 37]. Among mitochondrial genes, the ccmB gene was identified with 40 potential RNA editing sites and contained the most potential RNA editing sites in our analyses. The mttB gene follows with 35 RNA editing occurrences. Most RNA editing instances lead to alterations in the amino acid sequence, notably causing the production of initiation and termination codons, thereby modifying the gene structure. In our data, we identified seven such genes: atp6 and rpl16 acquired termination codons, while cox2, nad4L, nad5, nad7, and rps1 obtained initiation codons. In the case of cox2, it is affected by editing events at two positions (Supplementary Table 7).

Fig. 4
figure 4

Identified traits of RNA editing sites in protein-coding genes of S. gemmata mitochondrial genomes. The X coordinate indicates PCG, and the Y coordinate indicates the number of RNA editing sites

Repeated sequence analysis of S. gemmata organelle genomes

Plant organellar genomes are replete with repetitive sequences, primarily categorized into two types, tandem and dispersed repeats, which are differentiated principally by their physical proximity [38]. We conducted a genome-wide annotation of repetitive sequences to decipher the characteristics of S. gemmata repetitive sequences and their potential influence on genome structure (Fig. 5). Microsatellites or simple sequence repeats (SSRs), a particular class of tandem repetitive sequences, typically range in length from 1–6 bp. We identified 229 and 66 SSRs in the mitochondrial and chloroplast genomes, respectively (Fig. 5). Among these SSRs, 42 mononucleotide SSRs (accounting for 63% of the total) were detected in the chloroplast genome, all of which were A/T type, indicating a strong bias. The numbers of dimers, trimers, tetramers, and pentamers in chloroplast were 7, 4, 12 and 1, respectively. However, hexamer SSRs were not detected. In contrast, the distribution of SSRs in the mitochondrial genome was more balanced, with tetramers being the most abundant (74, approximately 34%), followed by monomers and dimers, accounting for 22% and 25%, respectively. Mononucleotide repeat sequences were exclusively A/T bases (Fig. 5D, Supplementary Table 8).

Fig. 5
figure 5

Repeat sequences within the mitochondrial and chloroplast genomes. The internal circle-colored lines correlate two dispersed repeats: blue for palindromic repeats, pink for forward repeats, yellow for reverse repeats, and green for complementary repeats. The second circle's black line denotes tandem repeats, while the outermost circle designates microsatellite repeats. A and B represent the physical distribution of repetitive sequences in the mitochondrial genome and the chloroplast genome, respectively. C and D represent the respective types of repeats and simple sequence repeats (SSRs)

In the S. gemmata mitochondrial genome, 31 tandem repeats have been detected, ranging from 11 to 39 bp in length. In the chloroplast genome, a total of 35 tandem repeats were identified, displaying a variation in length ranging from 9 to 33 base pairs (bp), as illustrated in Fig. 5C and detailed in Supplementary Table 9. In addition to these tandem repeats, dispersed repeats are also prevalent throughout the genome, and their function is observed to be of equivalent significance, as supported by a previous study [39]. We detected 1026 and 42 pairs of dispersed repeats in the mitochondrial and chloroplast genomes, respectively (Fig. 5A-C and Supplementary Table 10). Among the four types of dispersed repeats, forward, reverse, complementary, and palindromic, complementary repeats were not found in the chloroplast genome, and the other types of repeats were distributed in both organellar genomes, with most being approximately 30 bp in length.

Detection of lateral gene transfer (LGT) between organelles and TE annotation

Homology analysis of the S. gemmata mitochondrial and chloroplast genomic sequences revealed 17 fragments with an identity greater than 0.8, totaling a length of 9,025 bp, constituting 1.32% of the total mitochondrial genome length (Fig. 6 and Supplementary Table 11). Three homologous fragments were over 1,000 bp, specifically MTPT1, MTPT15, and MTPT17, with MTPT1 being the longest (4,396 bp), corresponding to two positions of IRa and IRb in the plastid genome. These 17 homologous sequences contained 32 gene sequences from the plastid genome. Of these, 15 were complete, including nine protein-coding genes (psbJ, psbL, psbF, psbE, petL, petG, ndhJ, ndhK, rps7) and six tRNA genes (trnD-GUC, trnN-GUU, trnI-CAU, trnM-CAU, trnP-UGG, trnW-CCA).

Fig. 6
figure 6

Comparative homology analysis of organelle genome sequence data. The mitochondrial and chloroplast genomes are represented by yellow and green arcs, respectively, while the blue lines connecting the arcs represent homologous genomic fragments

To validate the sequences of the MTPT regions in the mitochondrial genome, we selected nine MTPT regions with a length greater than 80 bp and mapped the mixed long reads from the mitochondrial and chloroplast genomes to corresponding areas. The mitochondrial reads were mapped to the MTPT and its flanking regions, whereas the chloroplast reads could only be mapped to the MTPT region, demonstrating that the sequence of the corresponding MTPT region had been integrated into the mitochondrial genome (Supplementary Fig. 5).

The numbers of TE elements in the mitochondrial and chloroplast genomes were 133 and 58, respectively, with total lengths of 1,9427 and 8,741 bp (Supplementary Tables 12 and 13). These were mainly of two types: DNA transposons (EnSpm/CACTA, Harbinger, Helitron, Mariner/Tc1 (specific to the mitochondrial genome), MuDR, hAT) and retrotransposons (Copia, Gypsy, L1). Mitochondrial genome TEs were predominantly retrotransposons (88 fragments, 16,158 bp), while those in the chloroplast genome were mainly DNA transposons (41 fragments, 6,292 bp). Finally, we observed a partial overlap between the sequence coordinates of the MTPT region and some TE elements in the organelle genomes (Supplementary Table 14).

The utilization of the organelle genomes of S. gemmata facilitated phylogenomic analyses of the tea plant family

To gain a more comprehensive understanding of the phylogenetic relationships of the Theaceae species, we constructed a maximum likelihood (ML) phylogenetic tree based on 36 mitochondrial genomes and 53 chloroplast genomes, including the S. gemmata released in this study, with two species of Saxifragales as outgroups (Supplementary Table 15). The ML phylogenetic trees based on mitochondrial and chloroplast genome sequences strongly supported Santalales and Caryophyllales as successive sister groups to Ericales (Fig. 7A and Supplementary Fig. 6), the species of Theaceae are monophyletic, with S. gemmata at the early-diverging lineage of the Theaceae family. In the phylogenetic reconstruction based on chloroplast genomes, the Theaceae family distinctly segregates into three tribes. Within this framework, the Stewartieae tribe emerges as a sister group to a clade formed by the Gordonieae and Theeae tribes. Our phylogenetic analysis supports the monophyly of several genera within this family, specifically Camellia, Polyspora, Pyrenaria, Schima, Gordonia and Stewartia. In the Gordonieae tribe, Gordonia and Franklinia occupy successive sister positions relative to Schima. Within the Theeae tribe, Apterosperma is delineated as the sister group to Pyrenaria, while Camellia and Polyspora also exhibit a sister group relationship. However, The limited availability of mitochondrial genome sequences from Theaceae constrains our understanding of their evolutionary dynamics from mitochondrial gene markers. This gap impedes a thorough exploration of the mitochondrial genome's influence on the family's phylogeny (Supplementary Fig. 6). In addition, we also found subtle conflicted relationships between the mitochondrial and chloroplast phylogeny in the genus Camellia. Through a comprehensive analysis of the informative sites within the orthologous genes specific to the genus Camellia, we discerned a discrepancy in the informative site rates between the chloroplast and mitochondrial genomes. Averagely, the informative site rate associated with the chloroplast genome was observed to be higher than that of the mitochondrial genome (Fig. 7B). Based on these findings, we hypothesize that the phylogenetic tree constructed from the chloroplast genome may provide a more accurate and representative depiction of the underlying evolutionary relationships for the tea plant family.

Fig. 7
figure 7

Best maximum likelihood phylogenetic tree based on chloroplast genomes of 53 species (including 25 species of Theaceae). A shows the phylogenetic tree constructed by the chloroplast genome. Numbers above each branch are the maximum likelihood bootstrap of each clade > 50%. The four different background colors indicate different orders, respectively. The red font indicates the species of this study. In the top left corner is a tree with branch length information. Pentagrams of different colors at the nodes represent the Theaceae family and delineate its three distinct tribes. B shows the information loci of organelle genes involved in the construction of phylogenetic relationships (number of informative loci/effective length of the gene). Red indicates the chloroplast genome (Cp), and blue indicates the mitochondrial genome (Mt)

Analysis of the mitochondrial genome collinearity among S. gemmata and other species of Theaceae

To better elucidate the conservatism of mitochondrial genome evolution between S. gemmata and other species in Theaceae, we performed a collinearity analysis of the mitochondrial genome sequences. The results showed that a total of 5,968 regions of covariance existed among the seven Theaceae species (Fig. 8 and Supplementary Table 16), with a wide range of collinearity. Among them, the total length of the covariance blocks between OP270590.1 and OL989850.1, OL989850.1 and OM809792.1, MH376284.1 and MK574876-7.1, and MK574876-7.1 and NC067639.1 was all more than 1000 kb. At the same time, we were able to identify a large number of inverted regions in the genomes of these species, suggesting that the chromosomal structure of the mitochondrial genome is less conserved.

Fig. 8
figure 8

Mitochondrial genome collinearity analysis of seven selected Camellia species. Red arc-shaped regions mark areas of inversion occurrence, and gray sections represent those with a high degree of homology

Discussion

Structure, size, and repetitive sequences in the organelle genomes of S. gemmata

A prior investigation revealed that while chloroplast genomes exhibit conserved size and gene structure across cucurbit species, mitochondrial genomes display considerable heterogeneity. This diversity encompasses variations in size, gene content, organizational structure, and noncoding DNA characteristics [40]. This is the first study to report the complete organelle genome of S. gemmata. Considering the long-read sequencing coverage of repeated nodes, we inferred two mitochondrial genome conformations, including a closed circular and two small circular alternative conformations mediated by the repeat. Our results indicated that the high percentage of long-read support for alternative conformations implies that the mitochondrial genome of S. gemmata possesses a high frequency of recombination. On the other hand, we uncovered low conservation in the genetic composition, especially in tRNA-related genes. The fluctuating trend of total copy numbers, initially increasing and then decreasing, indicates historical genetic expansion events followed by a current contraction phase in the Theaceae mitochondrial genomes.

Typically, the structure of land plant chloroplast genomes exhibits a quadripartite circular conformation composed of a pair of IR sequences and LSC and SSC regions with conservative structural features [22]. We found the chloroplast genomic structures along with gene type and number to be fairly comparable in Theaceae [41]. In line with our observations, we propose that this highly preserved characteristic could potentially be an outcome of evolutionary mechanisms subjected to purifying selection. Such selection dynamics appear to favor the functional conservation of chloroplasts, specifically those involved in photosynthetic processes. This suggests a significant evolutionary advantage in maintaining the efficiency and integrity of photosynthesis-related functions within the chloroplast.

Repetitive sequences are widespread in organelle genomes, including tandem repeats, dispersed repeats, and SSRs, playing a vital role in genomic structure and genome size and could mediate genome recombination [17, 23, 39, 42, 43]. We detected an abundance of multiple types of repetitive sequences in the S. gemmata organelle genomes as well. The number of dispersed repeats in the chloroplast genome is significantly lower than that in the mitochondrial genome, which results in a more complex structure of the mitochondrial genome compared to the chloroplast genome. On the other hand, SSRs are often used to develop molecular markers due to their exceptional polymorphism across species. The application of SSRs is widely utilized in species identification, thus facilitating the conservation and utilization of germplasm resources [44]. In this study, we identified a substantial number of SSRs, providing a framework for germplasm identification, conservation, and utilization.

Codon usage pattern analysis among mitochondrial and chloroplast genes of S. gemmata

As an important evolutionary phenomenon, codon usage bias in genes has been documented in a variety of organisms in prokaryotic and eukaryotic species [45]. Such biases are attributed to an intricate interplay of mutation pressure, natural selection, and genetic drift over long periods of evolutionary time [46, 47]. Extensive studies have been conducted on codon bias in plants, revealing that codon usage is significantly influenced by the composition of nucleic acids in the nuclear genome. In organelle genomes, natural selection has a more pronounced effect on codon bias [33, 48]. We found a strong predilection for A/U-ending codons in the organelle genomes of S. gemmata, which is consistent with codon usage bias within the mitochondrial genomes of the other four species in Theaceae [30]. Similar findings were also reported in previous studies on the chloroplast genomes of several plants, including monocots such as wheat and rice [49, 50] and eudicots such as cucumber and Elaeagnus [33, 51], which suggests that A/U-ending codon bias might be an inherent characteristic of organelle genomes.

The ENC-GC3 (effective number of codons versus the third position GC content) analysis is a robust technique employed to investigate the underlying factors that may influence species-specific codon usage patterns [33]. This analysis posits that genes located proximally or overlapping with the standard curve within the ENC-GC3 plot are more susceptible to the influence of mutations, potentially altering their codon bias. This relationship allows for an exploration of the evolutionary constraints and mutational pressures acting upon these genes. On the other hand, genes further from the standard curve are more likely influenced by natural selection [34]. Our research revealed that most photosynthesis-related genes in the chloroplast genome, as well as the majority of respiration-related genes in the mitochondrial genome, deviate from the standard curve. This deviation is consistent with the trends seen in other species within the tea plant [30]. This indicates that these genes in Theaceae species are under intensified natural selection pressure, tending toward a more conservative evolution.

TEs and lateral gene transfer (LGT) in mitochondrial and chloroplast genomes

Transposable elements (TEs) in mitochondria are common in intergenic regions. For instance, TEs are abundantly distributed in the mitochondrial genome of Arabidopsis thaliana, suggesting a transfer from nuclear genomes to mitochondrial genomes [52]. In addition, there is evidence of DNA sequence transfer from the plastid genome to the mitochondrial genome without functions [23]. In our study, the TE in the mitochondrial genome was dominated by retrotransposons [53], while the chloroplast genome was dominated by DNA transposons. In addition, we observed an overlap between TE and some MTPTs, especially the DNA-11N_SBi element, shared in the mitochondrial and chloroplast genomes. This suggests that TEs might also be involved in sequence migration between organelle genomes. LGT could play a vital role in elucidating the evolutionary relationship between mitochondrial and chloroplast genome structures. We found 17 mitochondrial plastid DNAs (MTPTs) in the mitochondrial genome, comprising 15 complete chloroplast genes. Prior research identified four plastid-derived tRNA genes (trnD-GUC, trnH-GUG, trnN-GUU, and trnW-CCA) in the mitochondrial genomes of Asterids and Ligustrum quihoui [53]. Our study detected that trnD-GUC, trnN-GUU, and trnW-CCA originated from the chloroplast genome, which implied that the mitochondrial genome acquired these genes during an ancient LGT event. The longest MTPT1 (4,396 bp) had 100% similarity, indicating a likely recent acquisition from the chloroplast.

Phylogenetic relationships in Theaceae

A previous study has shown that the tribes Stewartieae and Gordonieae are successive sister groups to the tribe of Theeae [1]. The tribe Stewartieae, as the earliest diverging tribe within the Theaceae family, occupies a pivotal position in phylogenetic research for understanding the evolutionary history of the tea plant family. Unfortunately, there is a lack of a comprehensive and high-quality assembly and annotation of both organelle and nuclear genomes for the Stewartieae and Gordonieae tribes. This absence has created gaps, affecting not only phylogenetic analyses but also various other aspects of genetic and evolutionary studies within the tea plant family. In our study, we released the organelle genomes of S. gemmata and reconstructed the phylogeny based on the conserved PCGs of the mitochondrial and chloroplast genomes. The highly supported phylogeny clearly showed that all species were divided into three clades, Santalales, Caryophyllales, and Ericales. The phylogenetic trees from both the mitochondrial and chloroplast genomes suggested that Stewartia was the first-diverged clade of Theaceae. The chloroplast-derived phylogenetic relationships within Theaceae are largely congruent with those established in previous studies [54]. The current mitochondrial genome dataset encompasses only species from the Camellia genus and the species S. gemmata released in this study from the Stewartia genus. A limited number of genomes impede our understanding of intergeneric relationships within Theaceae. The acquisition and analysis of additional mitochondrial and nuclear genomes are anticipated to elucidate the phylogenetic relationships of Theaceae species more comprehensively in the future. Moreover, in the phylogenetic trees derived from mitochondrial genomes (ML), regardless of the selection of different gene sets, the tea plants (Camellia sinensis) are not a monophyletic group. Utilizing the analysis of phylogenetically informative sites of mitochondrial and chloroplast orthologous genes, we identified that the uncertainty in the relationships within Theeae, as depicted in the mitochondrial maximum likelihood (ML) trees, is primarily attributable to the scarcity of informative loci within mitochondrial genes (Fig. 7B and Supplementary Fig. 6). Moreover, an analysis of organelle sequence transfers revealed that the mitochondrial genome contains exogenous sequences from the chloroplast genome in S. gemmata. Differences in the evolutionary mutation rates of sequences from different genomes and mitochondrial-riched recombination might affect the resolution of phylogeny. To date, fewer than ten Theaceae mitochondrial genomes have been sequenced, and the limited number of genomes might lead to data bias, thus limiting the phylogentic resolution to some extent. Overall, our study provides the first mitochondrial genome of the genus Stewartia that contributes to the understanding of phylogenetic relationships within Theaceae.

Conclusion

In this study, we firstly present a comprehensive assembly and annotation of the complete mitochondrial and chloroplast genomes of S. gemmata, utilizing both Illumina short-read and Nanopore long-read sequencing technologies. This organelle genome dataset, as a representative of the earliest diverging clade of the tea plant family, lays the foundation for further exploration of multiple biological facets of Theaceae, including phylogeny, evolutionary history, lateral gene transfer (LGT), and complexities of species hybridization. The study characterized different aspects of the organelle genomes, including the genome size, composition, repeated sequences, and LGT events between the organelle genomes. A large number of repetitive sequences have been identified, elucidating the characteristics of the two types of organelle genomes. Additionally, the identification of simple sequence repeats (SSRs) provided data to support the development of molecular phylogenetic markers. The reconstruction of the phylogenetic relationships within the Theaceae family, based on conserved organelle genes, has afforded valuable insights into its phylogeny. This study has elucidated not only the specific characteristics of the organelle genomes of S. gemmata but also the broader evolutionary history of the tea plant family. The complete decoding of the organelle genome serves as a pivotal resource, paving the way for multifaceted analyses in various biological studies.

Materials and methods

Genome assembly and annotation

Fresh S. gemmata leaves were harvested from Xinyang City, Henan Province (114.08°E, 32.13°N). Immediately after collection, they were rapidly frozen in liquid nitrogen and stored in a -80℃ environment, followed by DNA extraction. The libraries were constructed using the Nextera DNA Flex Library Prep Kit (Illumina, San Diego, CA, USA) and sequenced on the Illumina NovaSeq 6000 platform. The long fragment library was constructed using the SQK-LSK109 kit, quality-checked using Qubit, and then sequenced using the PromethION P48 platform.

Mitochondrial genome assembly and annotation:

  1. (1)

    The long-read data were assembled using the default parameters of Flye (v2.9) [55] to obtain graphical assembly results in GFA format. A library of all assembled contig sequences in GFA format was built using Makeblastdb, a specific module within BLAST (v2.13.0) [56], and contig fragments containing mitochondrial genomes were identified using the BLASTn (v2.13.0) [57], with conserved mitochondrial genes in Arabidopsis as query sequences with the parameters “-evalue 1e-5 -outfmt 6 -max_hsps 10 -word_size 7 -task blastn-short”. Visualization of GFA files was accomplished through Bandage (v0.8.1) [58]. Finally, the long-reads were mapped onto the graphical mitochondrial genome fragments using BWA (v0.7.17) [59] with default parameters, and the mitochondrial long-reads were exported for subsequent resolution of the repetitive sequence regions of the graphical plant mitochondrial genome.

  2. (2)

    Using the contigs obtained in step (1), we mapped short-reads to mitochondrial contigs using BWA (v0.7.17) with default parameters [59] and exported the mitochondrial short-reads. Based on the obtained mitochondrial short-reads and long-reads, we used Unicycler (Pacific Biosciences, Menlo Park, CA, USA) [60] for hybrid assembly to obtain the final mitochondrial genome sequence.

  3. (3)

    Mitochondrial genes of Stewartia gemmata were annotated utilizing Geseq v2.03 (https://chlorobox.mpimp-golm.mpg.de/geseq.html) [61]; We employed homologous comparison with mitochondrial genomes from four other reference species. The reference mitochondrial genomes included Arabidopsis thaliana (NC_037304), known for its complete and well-curated mitochondrial genes, Liriodendron tulipifera (NC_021152.1), recognized for its comprehensive mitochondrial variable genes, along with Camellia nitidissima (NC_067639.1) and Camellia sinensis (NC_043914.1), both closely related to Stewartia gemmata within the Theaceae family, to ensure precise and accurate annotation of mitochondrial genes. tRNAs were annotated using tRNAscan-SE (v2.0.11) [62], and rRNAs were annotated using BLASTn (v2.13.0) [57]. Each gene was manually corrected using Apollo (v1.11.8) [63].

Chloroplast genome assembly and annotation:

  • The sequencing data were assembled using the default parameters of GetOrganelle (v1.7.5) [64] with the parameter “-R 15 -k 21,45,65,85,105 -F embplant_pt”, resulting in a cyclic S. gemmata chloroplast genome. The chloroplast genome was annotated using CPGAVAS2 (http://47.96.249.172:16019/analyzer/home, last accessed on April 2023) [65]. tRNA was annotated using tRNAscan-SE (v.2.0.11) [62], and rRNA was annotated using BLASTn (v2.13.0) [57]. Annotation errors for each gene were manually modified using CPGView (http://www.1kmpg.cn/pmgview) [66] and Apollo (v1.11.8) [63]. OGdraw (https://chlorobox.mpimp-golm.mpg.de/OGDraw.html) [67] was used to map the structure of the mitochondrial genome and the chloroplast genome, and the chloroplast and mitochondrial genome sequences are stored in GenBank, with the accession numbers OQ948158 (Mitochondrial genome) and OQ948159 (Chloroplast genome).

Estimation of nucleotide substitution rates between homologous gene pairs

We used Phylosuite (v1.1.16) [68] to extract orthologous genes from collected mitochondrial and chloroplast genome datasets; and performed multiple sequence alignments by using MAFFT (v7.505, parameter “–auto”) [69]. The nucleotide substitution rates of paired orthologous genes, including the nonsynonymous mutation rate (dN), synonymous mutation rate (dS), and dN to dS ratio, were estimated using the yn00 module in pamlX (v1.3.1) [70]. The pairwise nonsynonymous (dN), synonymous (dS), and the ratio of nonsynonymous to synonymous substitutions (dN/dS) were visualized using density plots for each paired orthologs generated by the ggplot2 [71] package in R.

Codon usage bias analysis

Protein-coding sequences of the genome were extracted using Phylosuite (v1.1.16) [68]. Protein-coding genes of the mitochondrial genome were analyzed for codon preference using MEGA (v7.0) [72], and RSCU values were calculated. Effective codon number (ENC) values and synonymous mutation codon position 3 GC content (GC3s) were calculated using CodonW (v1.4.4) [73]. All data were visualized the R-package ggplot2 [71].

RNA editing site identification

We used the sequences of all protein-coding genes (PCGs) encoded by the mitochondrial genome of this species as input files and used Deepred-mt [74] to predict the C to U RNA editing sites of mitochondrial PCGs. We then retained all results with probability values greater than 0.9.

Repeat sequence analysis

The sequences, including microsatellite sequence repeats and tandem repeats, were identified using MISA (v2.1) (https://webblast.ipk-gatersleben.de/misa/) [75], TRF (v4.09) (https://tandem.bu.edu/trf/trf.unix.help.html) [76], and the REPuter web server (https://bibiserv.cebitec.uni-bielefeld.de/reputer/) [77]. The results were visualized using Excel (2021) and Circos (v0.69.9) [78].

Analysis of lateral gene transfer and TE identification

The corresponding positions of LGT sequences were obtained by comparing the chloroplast genome with the mitochondrial genome using BLASTn (v2.13.0) and default parameters [57], and the analysis of lateral gene transfer results was visualized using the Circos package (v0.69.9) [78]. The CENSOR (https://www.girinst.org/censor/index.php) [79] was used to identify TEs in S. gemmata organelle genomes, with default parameters and "green plants" as reference.

Phylogenetic analysis

Orthologous genes were extracted using PhyloSuite (v1.1.16) [68], multiple sequence alignment analysis was performed using MAFFT (v7.505, parameter “–auto”), followed by the concatenation of sequences into a super matix by using PhyloSuite, and finally ML phylogenetic analysis by using IQ-TREE(v1.6.12) with parameter “-m GTR + G -bb 1000”, and the phylogenetic tree was visualized using ITOL (v6) [80]. All genomic data were obtained from the NCBI GenBank database. Details were recorded in Supplementary Table 15.

Gene colinearity analysis

We selected the mitochondrial genomes of representative Theaceae species with accession IDs OL989850.1 [30], MH376284.1 [81], MK574876.1 and MK574877.1 [82], OP270590.1 [83], NC067639.1, and OM809792.1, and all genomic data were obtained from the NCBI GenBank database. BLASTn results for a two-by-two comparison of individual mitochondrial genomes were obtained based on the BLAST (v2.13.0) [57] program, and homologous sequences longer than 500 bp were retained as conserved colinear blocks. A multiple synteny plot was plotted using MCScanX [84].