Background

Lycium is a crucial shrub genus in the Solanaceae family, consisting of approximately 70 species found globally across Southern Africa, Europe, Asia, America, and Australia [1, 2]. However, China hosts 15 species, including subspecies, primarily growing in the arid and semi-arid regions of Ningxia, Xinjiang, Inner Mongolia, and other areas [3,4,5,6,7]. These plants have been utilized as food and herbal medicine in China for millennia [8, 9], with various parts like fruits, leaves, root bark, and young shoots used as local foods and/or medicines [10]. These plants impart potential pharmacological effects such as anti-aging properties, reduction of blood glucose and serum lipids, and immune regulation [11]. Several studies have demonstrated that extracts of these plants can prevent and treat diseases such as night sweats, diabetes, cough, vomiting, hypertension, and ulcers [10, 12, 13]. Overall, previous studies have focused on the growth, development, medicinal value, and breeding of Lycium [14,15,16,17]. Despite its economic and ecological significance, much remains unknown about Lycium phylogeny, interspecific relationships, and evolutionary history. While numerous studies have identified and analyzed phylogenetic relationships within the Lycium genus using DNA barcode fragments [18,19,20,21,22], research on CPG phylogeny and lineage diversification is lacking.

Chloroplasts are essential organelles involved in photosynthesis and carbon fixation in plant cells. The advancement in sequencing technologies has made whole CP sequencing more accessible [23,24,25]. Compared with DNA fragments, CPGs contain significantly more informative sites for analyzing nucleotide diversity and reconstructing phylogenies among closely related species [26,27,28]. CPGs typically range from 75 to 250 kb in length, with numerous copies in a given cell, are maternally inherited in most plants, and have conserved gene content and order [29, 30]. The plastome is characterized by two large inverted repeat regions (IRa and IRb) separated by two single-copy regions, referred to as the large single-copy region (LSC) and the small single-copy region (SSC) [31]. Recently, CPGs have been used for comparative and phylogenetic analyses, proving useful in species identification, genetic diversity assessment, nucleotide diversity assessment, resolving phylogenetic relationships, and evolutionary history [32,33,34,35,36].

Despite the use of CPGs, few molecular phylogenetic studies have attempted to resolve infrafamilial relationships within Lycium using broad taxon sampling. Most of these studies were based on one or a few molecular loci or had sampling limitations, leaving our understanding of the phylogenetic relationships and divergence timescales of the genus unclear [18, 37,38,39]. In this study, we sequenced and aligned the CPGs of 15 species, including subspecies, of the genus Lycium from China. Our main objectives were to (a) construct a phylogeny based on the CPGs of 36 species from the Solanaceae family; (b) date the divergence of the Lycium clade; and (c) examine structural changes in the CPGs of the sampled Lycium species.

Results

Characteristics of Lycium chloroplast genomes

After quality control and pre-processing, a minimum of 6 Gb of whole-genome sequencing data were obtained for each of the 16 species included in this study (Table S1). These clean reads were used to assemble complete CPs using a reference-guided approach. All the newly assembled CPGs displayed a typical quadripartite structure, with two IR regions separating the LSC and SSC regions (Fig. 1).

For each of the 15 Lycium species, the CPGs size ranged from 154,890 bp (L. changjicum) to 155,677 bp (L. amarum) (Table S1). All the CPGs had a typical quadripartite circular structure (Fig. 1) consisting of an LSC and an SSC region separated by a pair of IR regions (Fig. 1 and Table S1). The LSC region’s length varied from 85,892 bp (L. changjicum) to 86,635 bp (L. amarum), and the lengths of the SSC and IR regions ranged from 18,190 bp (L. cylindricum) to 18,215 bp (L. barbarum) and 25,394 bp (L. ruthenicum) to 25,469 bp (L. cylindricum), respectively (Table S1). The GC content of the entire plasmid sequence and the LSC, SSC, and IR regions was similar across all Lycium CPGs. Specifically, the GC content of the entire plasmid sequence was 37.8–37.9%; while the GC content of the IR regions was 43.1–43.2%, which was higher than that of the LSC and SSC regions (35.8–35.9% and 32.3–32.4%, respectively; Table S1). Additionally, the number of annotated genes in each CPGs ranged from 128 (L. amarum) to 132 (L. ningxiaens), and included 35–37 tRNA and 8 rRNA genes.

Fig. 1
figure 1

Gene map of the CPGs of fifteen Lycium species. Genes belonging to different functional groups are shown in different colors. The darker gray area in the inner circle indicates the GC content and the lighter gray indicates the AT content of the genome. The thick lines indicate the extent of the inverted repeats (IRa and IRb) that separate the genomes into the small single-copy (SSC) and large single-copy (LSC) regions

Comparative genomics and divergence hotspots

Using L. chinense as a reference, the CPGs of the 15 Lycium species were visually compared with those obtained using the mVISTA online database. The results showed that the CPGs of the 15 species were conserved, especially the coding regions (Fig. 2). Considering the entire plastome, the SSC and LSC regions displayed marked divergence compared to the IR regions. The proportion of non-coding regions was greater than that of protein-coding regions, and divergence hotspots were largely located in the intergenic spacer regions (Fig. 2). These divergence hotspots usually consist of highly variable sequences that can be used as potential DNA barcodes for phylogenetic analyses and to determine relationships between species. Therefore, to further understand DNA polymorphisms (Pi), mutation hotspot regions in the CPGs of the 15 Lycium plants were screened using DnaSP software (Fig. 3). Pi analysis revealed that Pi values ranged from 0 to 0.00851, and the CPGs were relatively structurally conserved, small, and highly variable among the species. We identified five mutation hotspot regions (Pi > 0.0006) that could be used as potential molecular markers. Of these, trnR(UCU), atpF-atpH, ycf3-trnS(GGA), and trnS(GGA) are located in the LSC region, while trnL-UAG is located in the SSC region (Fig. 3). None of the mutation hotspots are located in the IR region.

Fig. 2
figure 2

Sequence alignment of the CPGs of Lycium species. The alignment was performed using the mVISTA program and the L. chinense chloroplast genome was used as a reference. The Y-axis: the degree of identity ranging from 50 to 100%. Coding and non-coding regions were marked in blue and red, respectively. Black arrows indicated the position and direction of each gene. CNS: conserved non-coding sequences

Fig. 3
figure 3

Sliding window test of nucleotide diversity (Pi) in the multiple alignments of 15 Lycium species (window length: 600 bp; step size: 200 bp). X-axis: the position of the midpoint of the window; Y-axis: the nucleotide diversity of each window

Fig. 4
figure 4

Comparisons of the borders of the large single-copy (LSC), small single-copy (SSC), and inverted repeat (IR) regions among the CPGs of fifteen Lycium species

Boundaries between IR and SC regions

The boundaries of the LSC, SSC, and IR regions were highly consistent within Lycium and no obvious expansion or contraction of the IR region was detected in the 15 CPGs (Fig. 4). Here, trnH was shown to be the first gene in the LSC region at the junction between IRa and LSC (i.e., IRa/LSC). At the other end of the LSC region, the IRb/LSC junctions were located in the rps19 gene sequence in all Lycium species, with the length of the rps19 gene located in the IRb region, varying from 46 to 50 bp. At both ends of the SSC region, IRb/SSC junctions were located between the ycf1 and ndhF genes. However, SSC/IRa was found in the ycf1 gene, with the length of the ycf1 gene located in the IRa region varying from 995 to 1004 bp.

Phylogenetic analyses of Solanaceae

To infer the phylogenetic relationships of the 36 Solanaceae species, we included two Convolvulaceae species whose CPGs are publicly available in the GenBank database. These species (Calystegia hederacea and Convolvulus arvensis) were used as the outgroup for phylogenetic analyses. The final concatenated dataset included 55 plastid genes and 42,960 sites, of which 2992 were parsimony informative, after trimming poorly aligned regions and gaps with missing genes (Table S2). In the phylogenetic trees, most nodes in the ML and BI analyses of each dataset generated almost congruent topologies with high bootstrap support (BS > 90%) and posterior probability (PP = 1) values (Fig. 5 and S1, S2, S3, S4). Thirty-six plant species belonging to the Solanaceae family were divided among three typical branches. Branches of Physochlaina physaloides, Przewalskia tangutica, Scopolia carniolica, Atropanthe sinensis, Solanum betaceum, Anisodus acutangulus, Hyoscyamus niger, and Atropa belladonna were resolved as sisters of Lycium. Further, in the CDS phylogeny, Lycium species were clustered on one large branch, confirming that the independence of this genus is highly supported (BS; PP = 100%, 1) (Fig. S1 and S2). The 15 species of plants belonging to Lycium were divided into two typical sub-branches and three minor branches: clade I-1, comprising L. barbarum var. auranticarpum, L. cylindricum, L. yunnanense, L. barbarum, L. dasystemum var. rubricaulium, and L. dasystemum on one sub-branch (BS; PP = 100%, 1); clade I-2 consisting of L. chinense var. potaninii, L. truncatum, L. amarum, and L. chinense on another sub-branch (BS; PP = 100%, 1), and clade II comprising L. barbarum var. implicatum, L. qingshuiheense, L.ningxiaense, L. changjicum, and L. ruthenicum on a different branch (BS; PP = 100%, 1).

Fig. 5
figure 5

Phylogeny and clade divergence of Solanaceae and outgroups based on 55 plastome protein-coding genes. Stars indicate fossil and time constrations in this analysis. Geological periods are marked with background colors. Mya: million years ago; Pal: Paleocene; Pli: Pliocene; Ple: Pleistocene

Table 1 Estimated ages for Lycium subclades

Divergence time estimation

We estimated the divergence timescales of the major clades within Lycium according to the calibration of the gene tree constructed based on 55 plastid genes. The split between Lycium and its sister group was estimated to have occurred 17.7 Mya. The crown ages of all subclades in the genus Lycium were dated mainly within the Pleistocene, suggesting that numerous species of this genus originally diversified in the recent past (6 Mya) (Fig. 5; Table 1).

Discussion

Recent several studies have reported that most angiosperms have CPGs ranging in size from 120 to 170 kb, with the IR region typically spanning 20–30 kb [16, 27, 28]. In the present study, a comparative analysis of the CPGs indicated that those of 15 Lycium species from China were at the larger end of the spectrum for angiosperm organelle genomes, ranging from 154,890 bp (L. changjicum) to 155,677 bp (L. amarum). All Lycium species exhibited a typical quadripartite structure that is similar to other vascular plants, comprising the LSC region (85,892–86,635 bp), SSC region (18,190–18,215 bp), and two identical IR regions (25,394–25,469 bp). The GC content of the entire plastid sequence and the LSC, SSC, and IR regions was similar across all Lycium CPGs. Consistent with the results of numerous studies, the GC content of the IR region is relatively the highest among the four chloroplast genome regions [27], potentially explaining the conservation of the IR region [28, 29]. However, the IR partition shrinkage and expansion of the edge were the main factors of the size change, also enabling easy pseudogene production [34, 35]. In our study, we analyzed 15 CPGs within the highly conserved Lycium species and noticed that no major expansions or contractions occurred in the IR regions.

Highly variable regions offer valuable phylogenetic information. For example, variable regions aid can be used as development of molecular markers and species delimitation [38,39,40,41]. A good molecular marker must be a short, representative DNA fragment with high variability and amenable to amplification [42]. In this study, both the sequence and structure of Lycium CPGs were highly conserved. mVISTA analysis revealed that most of the variation in nucleotide sequences occurred in noncoding regions, consistent with previous reports, suggesting this variation as a common feature of angiosperms [43,44,45]. In the Lycium, several highly variable regions, such as matk, rps16–trnK and trnH–psbA, are recognized as potential DNA barcoding sites [39, 46]. In addition, our nucleotide diversity (Pi) analysis led to the identification of five highly variable regions with Pi values greater than 0.006, including three non-coding tRNA regions (trnS, trnL, and trnR) and two intergenic regions (atpF-atpH and ycf3-trnS (GGA). In conclusion, these mutation hotspot regions play an important role in the identification and characterization of Lycium plant species.

Lycium is a genus of shrubs with significant economic and ecological importance. Therefore, considerable efforts have been dedicated to cultivated practices, growth and development, medicinal application, and breeding [10,11,12,13,14,15,16,17]. However, to date, few studies have explored its phylogenetic relationships and divergent timescales [11]. The CPGs are central to molecular biology research and have become a prominent focus. In recent years, due to advances in sequencing technology and reduced costs, whole-genome sequencing has become an important tool for species identification, phylogenetic relationships, and evolutionary history reconstruction. Single- or several-gene-segment-based phylogenetic trees often produce inconsistent or even conflicting topologies due to the evolutionary rates and gene introgression. This complexity affects precise evolutionary relationship determination between species [47, 48]. In this study, we constructed a phylogenetic tree using the BI and ML methods. The CPGs of 15 Lycium species converged into branches with high support (Fig. S3 and S4). Notably, L. cylindricum, L. yunnanense, L. dasystemum, and L. dasystemum were similar to L. barbarum and closely related, while L. chinense was classified in the same branch as L. truncatum and L. amarum (BS; PP = 100%, 1.00, respectively), indicating that the three species are similar. Based on the phylogenetic tree results, we speculate that L. qingshuiheense, L. ningxiaense, and L. changjicum may have originated from L. ruthenicum. Although L. barbarum var. implicatum and L. qingshuiheense formed a small, distinct branch, the support rate was low, indicating that they may be the same species. This study demonstrates that high-resolution CPG sequences provide valuable resources for broad research on genetic information and species identification of Lycium spp.

The highly conserved and stable alignment of the 55 plastid genes allowed us to calibrate the divergence and origin of Lycium (Fig. 5). We used three tentative calibrations to estimate the diversification. While the estimated ages should be used with caution, our findings indicate that the Lycium diverged from the sister genus around 17.7 Mya, and the two successive clades within the Lycium diverged 4.95 and 1.7 Mya, suggesting relatively late clade diversifications. Specifically, most species diversification within the subclades of Lycium, as estimated from these plastid genes, appeared to have occurred in the recent past, mostly after 5 Mya., despite the fact that numerous species are currently acknowledged in genera. This observation may partly account for the widespread hybridization observed among these young species [46], likely resulting from incomplete reproductive isolation. The divergence timescales estimated here for the major subclades will serve as a basic timescale for diverse studies on this economically and ecologically important genus.

Conclusions

We analyzed the complete CPGs of 15 Lycium species and found that all exhibited a quadripartite structure, typical of most angiosperms. The CPG arrangement was highly conserved, with great sequence variation observed in the SSC region compared with that in the IR region. We did not detect major expansions or contractions in the IR region, nor did we find any rearrangements or insertions in the CPGs of the 15 Lycium species. We identified highly variable regions within Lycium that are likely to be useful for species delimitation. Phylogenetic tree construction based on the CPGs showed that all 15 Lycium species formed a monophyletic group, divided into two typical subbranches and three minor branches. Molecular dating suggested that Lycium diverged from its sister genus around 17.7 Mya, with species diversification of the Lycium species of China occurring mainly within the recent Pliocene epoch. Overall, our findings and the estimated divergence times will facilitate future studies on Lycium, assist in species differentiation, and facilitate diverse studies of this economically and ecologically significant genus.

Materials and methods

Taxon sampling, DNA extraction, and plastome sequencing

A total of 38 CPGs representing Solanaceae and related families were included in this study (Table S3). 36 CPGs from Solanaceae were selected, including all Lycium species found in China. Further, two additional CPGs from related families in Convolvulaceae were chosen as outgroups for phylogenetic analysis. Among these 38 CPGs, 16 complete CPGs were newly sequenced, and the others were obtained from GenBank (Table S3). The leaves used in this study were collected from natural populations in China (Table S3). The plant materials were identified by Dr. Lei Zhang and the voucher specimens (Table S2) were deposited in the Herbarium of North Minzu University (NMU; Yinchuan, China). For each species, we extracted total DNA from dried leaves preserved them in silica gel using the CTAB protocol [49]. Paired-end libraries with an insert size of 500 base pairs (bp) were constructed by Illumina (Qingdao, Shandong, China) following sequencing with a HiSeq × Ten System (Jizhi, Qingdao, Shandong, China).

Chloroplast genome and annotation

At least two gigabases (Gb) of 2 × 150 bp short read data were generated for each sample. Reads with quality scores of less than 7 and with more than 10% ambiguous nucleotides were filtered. The remaining reads were assembled using NOVOPlasty version 2.7.2 [50] software with k-mer = 39, read length = 150 and insert size = 350. The contigs were aligned into sequence in Geneious version 9.1.8 [51] software using the L. chinense CPG as a reference. The CPGs were annotated using Plann version 1.1 [52]. Protein-coding genes were extracted using customized Python scripts. Alignment of chloroplast genes across all species was performed using PRANK version 130,410 [53] software using the codon model. Poorly aligned regions were trimmed using Gblocks version 0.91b [54] with the option “−t = c,” selected to set the type of sequence to codons. Genes that were absent in at least one species were excluded, and the aligned sequences were combined into a super matrix. Additionally, circular maps of the CPGs were created using OGDRAW version 1.2 [55], and all annotated CPGs were submitted to GenBank [56].

Comparative genomics and structural analyses

The structural variation and identification of arrangement events across Lycium was conducted for the 15 CPGs of Lycium. The results of the comparative analysis of the CPGs were visualized with the mVISTA program [57] and the annotated CPG of L. chinense was used as the reference in the LAGAN mode [58]. The junction sites of the four structural regions (IRA, LSC, SSC, and IRB) and adjacent genes in 15 Lycium CPGs were visualized and compared using IRscope [59] software to obtain a macroscopic view of the CPG structure. Following sequence alignment, nucleotide diversity (Pi) analysis of the CPG was performed using DnaSP version 6.0 [60].

Phylogenetic inference and divergence time estimation

We generated two datasets for phylogenetic analysis: a protein-coding region (CDS) set and a whole plastome (WP) set. Protein-coding genes (PCGs) were extracted from the GenBank formatted file containing 38 CPGs using customized Perl scripts that removed the start and end codons. After excluding possible pseudogenes, 55 PCGs were retained in all species. Each PCG was aligned using PRANK version130410 based on the translated amino acid sequences. Genes that had been lost in at least one species were discarded and then the remaining aligned sequences were concatenated into a super matrix. Independent phylogenetic analyses were performed for each dataset (CDS and WP) using the maximum likelihood (ML) and Bayesian inference (BI) methodologies. We used RAxML version 8.1.24 [61] to conduct ML analyses with a general time reversible model with a gamma distribution (GTR + Γ). The best-scoring ML tree was obtained using the rapid hill-climbing algorithm (i.e., the option “-f d”) with 1,000 bootstrap replicates. The optimal model (GTR + I + G) was identified using jModeltest software, and BI analysis was conducted using MrBayes version 3.2.6 [62]. Additionally, FigTree version 1.4.2 [63] was used to visualize phylogeny.

We estimated divergence times from the plastome dataset using an approximate likelihood method, as implemented in MCMCtree in PAML version 4 [64] software, with independent relaxed-clock and birth–death sampling [65] strategies. Fossil dates and time constrations were used as calibration points to reduce bias for more accurate age estimates [66]. However, the fossil records of Solanaceae are very limited. In order to estimation the divergence time of Lycium, we choosed one fossil and two other time constrations to calibrate the phylogeny: (1) According to the ancient seed fossil of Solanum nigrum [67], 5.3–11.6 Mya was assigned to the split between Solanum nigrum and its sister species Tubocapsicum anomalum. (2) The split between Solanum and Lycium was assigned an age range of 19.0-23.3 Mya as previously estimated [68]. (3) The root of the phylogeny was restricted to an age range of 46.2–53.7 Mya based on the secondary age constraints described by Särkinen et al. [68]. The best-fit GTR + Γ model was selected and the prior of the substitution rate (rgene) was modeled by a Γ distribution as Γ (2, 200, 1). We set parameters for the birth–death process with species sampling and σ2 values of (1, 1, 0.1) and G (1, 10, 1), respectively. We executed the MCMC runs for 2,000 generations as burn-in and then sampled every 750 generations until 20,000 samples were obtained. We compared two MCMC runs for convergence using random seeds and obtained similar results.