The structure of the majority of the flowering plant chloroplast (cp) genome consists of a pair of inverted repeats (IRs), along with large single-copy (LSC) and small single-copy (SSC) regions, and cp genome size ranges from 107 to 280 kb [1, 2]. With the emergence of next-generation sequencing technology [3], complete cp genome sequences are being extensively used to improve phylogenetic resolution at the interspecific level [4]. In addition, cp genomes have been found to contain polymorphic regions generated through genomic expansion, contraction, inversion, or gene rearrangement, and such sequences have been widely used as an effective tool for plant phylogenomic analyses [5].

The invasive species Xanthium spinosum belongs to the family Asteraceae and is within the subtribe Ambrosiinae (Heliantheae), which includes annual and perennial herbaceous plants [6]. It is native to South America and has been introduced to Canada, the United States, Central and South America, parts of Africa, the Middle East, Russia, China, Australia, and the Korean Peninsula [7,8,9,10]. The genus Xanthium has been widely used for various traditional medicinal treatments in multiple countries [11]. Parts of the X. spinosum plant are used for the treatment of cancer and diarrhea [12, 13], intermittent fever related to hydrophobia and rabies [14], and rheumatoid arthritis [15], and have antibacterial [14] and antiviral properties [14, 16,17,18]. Although several antimicrobial substances and their functions have been studied in X. spinosum over the past five decades, no exclusive genetic or genomic studies have been conducted to date.

Universal molecular markers such as the plastid genes rbcL and psbA and nuclear internal transcribed spacer (ITS) have been widely used for the rapid and precise identification of plant species but have proved unsuccessful for distinguishing very closely related species [19,20,21]. The genus Xanthium is commonly known as cocklebur, and is a close relative of the genus Ambrosia. The number of species in the genus Xanthium remains under debate, and this genus may include 5 to more than 20 species [22,23,24,25]. Phylogenetic analyses of several plastid and nuclear DNA markers have shown conflicting results for Xanthium and its relatives [11]. By contrast, Somaratne et al. (2019) used 46 cp protein-coding genes (PCGs) to resolve the phylogenetic positions of Xanthium and Parthenium and revealed that Parthenium is not an early-diverging lineage of the subtribe Ambrosiinae. However, most plant cp genomes contain highly conserved structures that are useful molecular markers for the identification of plant species in genome-wide evolutionary studies; such structures provide significant quantities of genetic information and can resolve taxonomic and phylogenetic relationships [26, 27].

In the present study, we examined both plastome evolution and the phylogenetic relationships within Heliantheae. For this purpose, we first sequenced and characterized the X. spinosum cp genome and compared it with the X. sibiricum cp genome as well as those of closely related species of Heliantheae. In addition, we identified hotspot regions of sequence variation and clarified the evolutionary dynamics among Xanthium species.


General features of the cp genome and its organization

The complete cp genome of X. spinosum was 152,422 bp in length. The cp genome showed a typical quadripartite structure containing two short inverted repeats (IRa and IRb) (25,075 bp) separated by a small single-copy (SSC) region (18,083 bp) and a large single-copy (LSC) region (84,189 bp) (Fig. 1). The cp genome encodes 115 unique genes, including 80 PCGs, 31 transfer RNA (tRNA) genes, and 4 ribosomal RNA (rRNA) genes. Six protein-coding, six tRNA, and four rRNA genes were duplicated in the IR regions. The overall GC content of the cp genome was 37.4%, while those of LSC, SSC, and IR regions were 35.4, 31.2, and 43%, respectively (Table 1).

Fig. 1
figure 1

Gene map of Xanthium spinosum. Genes lying outside the outer circle are transcribed in a counter-clockwise direction, and genes inside this circle are transcribed in a clockwise direction. The coloured bars indicate known protein-coding genes, transfer RNA genes, and ribosomal RNA genes. The dashed, dark grey area in the inner circle denotes GC content, and the light grey area indicates genome AT content. LSC, large single-copy; SSC, small single-copy; IR, inverted repeat

Table 1 Comparative analysis of chloroplast genome of X.spinosum and X. sibiricum

Comparative analyses of Xanthium species

The borders of LSC-IRb and SSC-IRa in the cp genome of X. spinosum were compared to three other closely related species of Heliantheae, namely, X. sibiricum, Ambrosia artemisiifolia, and Parthenium argentatum [28, 29] (Fig. 2). An intact copy of the rps19 gene was present in the LSC/IRb borders of X. spinosum, A. artemisiifolia, and P. argentatum, as well as a shared 95 bp to 119 bp sequence in the IRb region adjacent to the rpl2 gene. By contrast, the X. sibiricum rps19 gene was completely shifted to the LSC region, 71 bp away from the IRb region, despite the rpl2 gene being present at the LSC/IRb border. In addition, 154–175 bp of the fragmented rps19 gene in all four species was present at the IRa/LSC, LSC/IRa regions or its border. On the other hand, ѱycf1 was present in the IRa/SSC border of X. spinosum, whereas it was located in the IRb or silenced in the SSC region of X. sibiricum and A. artemisiifolia, and was situated in the SSC region of the P. argentatum cp genome. The entire ndhF gene was present in the SSC region of all four cp genomes. Similarly, an intact ycf1 gene was present in the SSC/IRa region of all of the cp genomes analyzed, except P. argentatum, which has a 565 to 583 bp fragment of ycf1 in the IRa region. However, P. argentatum encodes two copies of ѱycf1 in its genome. The trnH gene sequences are located in the LSC region 0 to 118 bp from the IRa/LSC border in all cp genomes.

Fig. 2
figure 2

Comparison of the large single-copy (LSC), small single-copy (SSC) and inverted repeat (IR) border regions of four Heliantheae (Xanthium spinosum, X. sibiricum, Ambrosia artemisiifolia and Parthenium argentatum) chloroplast genomes. Ѱ indicates a pseudogene. The figure is not drawn to scale

The cp genomic sequences of four Heliantheae species were analyzed using mVISTA software to detect variation among the sequences (Fig. 3). The sequence divergence differed markedly among regions. The data revealed that the non-coding region was more divergent than its coding counterparts. Relative to the LSC and SSC regions, IR regions of all cp genomes were less divergent.

Fig. 3
figure 3

Sequence alignment of four Heliantheae chloroplast genomes performed using the mVISTA program with Xanthium spinosum as a reference. The top grey arrow shows genes in order (transcriptional direction) and the position of each gene. A 70% cut-off was used for the plots. The y-axis indicates a percent identity of between 50 and 100%, and the red and blue areas indicate intergenic and genic regions, respectively

Repeat structure and SSR analyses

The presence of repeat sequences in the X. spinosum and X. sibiricum cp genomes was analyzed and the species were compared. Repeats in the X. spinosum cp genome consist of 264 forward, 256 palindromic, 251 reverse, and 228 complement. By contrast, X. sibiricum contained 18 forward, 15 palindromic, 6 reverse, and 2 complement repeats (Fig. 4a). In total, X. spinosum and X. sibiricum contain 999 repeats and 41 repeats, respectively. Among the 999 repeats identified in X. spinosum, repeats of 30–39 bp in length (983) were predominant in the cp genome; the longest repeat was 115 bp and was a palindrome sequence. Similarly, in X. sibiricum, 34 repeats were 30–39 bp in length, and the longest was a palindromic sequence of 177 bp (Fig. 4b).

Fig. 4
figure 4

Comparison he distribution of different repeat types in the Xanthium spinosum vs. X. sibiricum cp genomes. a The number of different types of repeats. F – forward repeats; R – Reverse repeats; P – palindromic repeats; C – complement repeats. b The length and the total number of repeat sequences present in their respective cp genomes

In total, 701 and 705 simple sequence repeats (SSRs) were identified in the X. spinosum and X. sibiricum cp genomes, respectively. The 701 SSRs in the X. spinosum cp genome included 247 (35.24%) mono-nucleotide repeats, 30 (4.3%) di-nucleotide repeats, 58 (8.3%) tri-nucleotide repeats, 67 (9.6%) tetra-nucleotide repeats, 80 (11.4%) penta-nucleotide repeats, 112 (15.98%) hexa-nucleotide repeats, 31 (4.42%) 7-nucleotide repeats, and 76 other repeats ranging from 8 nucleotides to 27 nucleotides (10.84%) (Fig. 5a). Similarly, the cp genome of X. sibiricum contained 250 (35.46%) mono-nucleotide repeats, 28 (3.97%) di-nucleotide repeats, 63 (8.94%) tri-nucleotide repeats, 74 (10.5%) tetra-nucleotide repeats, 81 (11.49%) penta-nucleotide repeats, 114 (16.18%) hexa-nucleotide repeats, 32 (4.54%) 7-nucleotide repeats, and 63 repeats with lengths from 8 nucleotides to 21 nucleotides (8.94%). Furthermore, the distributions of SSRs in the LSC, IR and SSC regions of X. spinosum and X. sibiricum indicated that the corresponding cp genomes contain 483 and 481 SSRs in the LSC, 91 and 93 in the IR, and 127 and 131 in the SSC regions (Fig. 5b). Likewise, SSRs were analyzed in the protein-coding (exon, protein-coding exon), intron and intergenic spacer (IGS) sequences of X. spinosum and X. sibiricum, which indicated that their cp genomes contain 244 and 252 SSRs in CDs, 69 and 69 in introns and 388 and 384 in IGS regions, respectively (Fig. 5c).

Fig. 5
figure 5

Comparison the presence of simple sequence repeats (SSRs) in the Xanthium spinosum vs. X. sibiricum cp genomes. a Distribution of different types of SSRs. b Presence of SSRs in the LSC, SSC, and IR regions. c Presence of SSRs in intergenic spacers, protein-coding regions, and intron regions

Nucleotide diversity analyses

The nucleotide diversity of 208 regions was analyzed using DnaSP software, including 79 PCGs and 129 intergenic and intron regions in the cp genomes of X. spinosum and X. sibiricum. The most variable region was infA (0.03) among PCGs (Fig. 6a), and high variability was observed for the trnH-psbA (0.05), psbA-trnK (0.06), trnK exon2-matK (0.09), psbI-trnS (0.05), ycf3-trnS (0.07), trnF-ndhJ (0.21), ndhC-trnV (0.13), trnV intron (0.07), petD-rpoA (0.05), infA-rps8 (0.18), rpl14-rpl16 (0.05), rpl16-rps3 (0.03), psaC-ndhD (0.09) and trnL-rpl32 (0.08) genes in introns and intergenic regions (Fig. 6b; Table 2).

Fig. 6
figure 6

The genetic diversity based on Kimura’s two-parameter model. a The P-distance value of protein-coding genes (b) the P-distance value of intron and intergenic regions (c) Boxplots of P-distance value difference among LSC, IR and SSC regions (d) Boxplots of P-distance value differences between protein-coding genes and intron and intergenic regions

Table 2 Mutational hotspots between X. spinosum and X. sibiricum

Synonymous (KS) and nonsynonymous (KA) substitution rate analyses

The synonymous and nonsynonymous substitution rates were evaluated for 79 PCGs in the X. spinosum and X. sibiricum cp genomes. The KA/KS ratios of nearly all genes were less than 1, except for the PCG accD (1.56) (Fig. 7).

Fig. 7
figure 7

Comparison the ratio of non-synonymous (KA) to synonymous (KS) substitutions of 79 protein-coding genes of Xanthium spinosum vs. X. sibiricum cp genomes

Positive selection analyses of the accD gene

Positive selection of the accD PCG in Heliantheae cp genome species was investigated using site-specific models with four comparisons (M0 vs. M3, M1 vs. M2a, M7 vs. M8, M8a vs. M8), using a likelihood ratio test (LRT) threshold of p ≤ 0.05 in EasyCodeML software. Among these models, M2a was the positive selective model and p (p0, p1 and p2) are the proportions of negative or purifying, neutral, and positive selection, respectively. The ω2 value of the accD gene was 3.70 in the M2a model. In addition, Bayes empirical Bayes (BEB) analyses were used to analyze the locations of consistent selective sites in the accD PCG using the M7 vs. M8 model comparison, and one site was found to potentially be under positive selection, with posterior probabilities greater than 0.95, while another site had probabilities greater than 0.99 (Table 3); the 2ΔLnL value was 25.91 and the p-value of LRT was 0 (Table 4).

Table 3 Comparison of site models, positive selective amino acid loci and estimation of parameters for accD protein-coding genes in the Heliantheae species
Table 4 Comparison of likelihood ratio test (LRT) statistics of positive selection models against their null models (2ΔLnL) for the accD gene

Phylogenetic analyses

In all, 71 PCGs from 21 cp genome sequences were selected for inferring phylogenetic relationships among closely related species of Heliantheae, and Ligularia fischeri (MG729822) was selected as an outgroup. A maximum likelihood tree was constructed using 71 concatenated PCGs in the cp genomes. The genus Xanthium was closely related to the genus Ambrosia (Fig. 8). Our analyses showed that Parthenium was a sister clade to both Xanthium and Ambrosia, and also an early-diverging lineage of the subtribe Ambrosiinae with a weak bootstrap value (57%).

Fig. 8
figure 8

Molecular phylogenetic tree based on 79 protein-coding genes of 21 Asteraceae chloroplast genomes. Achyrachaena mollis set as the outgroup. The tree was constructed by maximum likelihood analysis of the conserved regions using the RAxML program and the GTRI nucleotide model. The stability of each tree node was tested by bootstrap analysis with 1000 replicates. Bootstrap values are indicated on the branches, and the branch length reflects the estimated number of substitutions per 1000 sites


The single circular cp genome structure of X. spinosum was similar to that of X. sibiricum with a typical quadripartite structure and equal GC content (37.45%) unevenly distributed across the cp genome. Relative to the LSC and SSC regions, the GC content is greater in IR regions across both cp genomes, possibly due to the presence of four extremely conserved rRNA genes with high GC content in these regions. The expansion and contraction of IR regions was the main cause of variation in cp genome size, and assessing these differences could shed light on the evolution of related taxa [30, 31]. The cp IR boundary regions of X. spinosum were compared to those of closely related species, and little difference was found, except for position changes in ѱycf1. The sizes of the four cp genomes (X. spinosum, X. sibiricum, A. artemisiifolia, and P. argentatum) were not affected. Moreover, the length of each region and the total genome size were similar to those of most plant cp genomes reported previously [32].

Repeat units, which are dispersed in cp genomes at high frequency, play a significant role in genome evolution [33,34,35,36]. Our comparative analyses of X. spinosum and S. sibiricum cp genomes showed a 24.4-fold higher level of repeats in X. spinosum. An earlier study reported that variation in the number and type of repeats may play a major role in plastome organization; however, we found no correlation between these large repeat regions and rearrangement endpoints [37]. SSRs, also known as microsatellite repeats [38, 39], are common in the cp genome, and these sequences display a high level of polymorphism, supporting their use as a genetic marker in previous investigations [40, 41]. The contents of different types of SSRs and their distributions among cp regions were similar in X. spinosum and X. sibiricum. Multiple definitions of repeat motifs and repeat number within motifs have been used in the literature; our SSR definition aligns with those of Bilgen et al. [42] and Karaca et al. [43].

The cp genomes of Xanthium showed less variation in non-coding regions than in their coding counterparts. The LSC region exhibited higher divergence levels than the IR and SSC regions (Fig. 6c). Specifically, the two IR regions were least divergent, perhaps due to the presence of four highly conserved rRNA sequences in those regions. The average nucleotide diversity (π) of intergenic regions was 0.0170, almost four times as high as that of PCGs (π = 0.004195), revealing that intergenic regions show greater divergence (Fig. 6d).

Not all PCGs are phylogenetically useful for determining taxonomic discrepancies [44]. In previous studies, several plastid and nuclear DNA markers from non-coding regions have been used to resolve the phylogenetic position of Xanthium species, leading to inconsistent results [11]. Hence, the use of the additional markers and broader taxonomic sampling are required to achieve greater phylogenetic resolution at low taxonomic levels [11, 45]. Therefore, in the present study, we proposed a set of 14 divergent regions between X. spinosum and X. sibiricum to resolve taxonomic discrepancies and provide a genetic barcode for the genus Xanthium. All of these regions are intergenic spacer regions, which might be useful for the development of molecular markers to use in phylogenetic and phylogeographic studies. The 14 sequences identified in the present study are extremely polymorphic compared to the sequences used in previous studies [6, 11, 45]. Based on our data, molecular markers can be developed for these intergenic regions that may be used for phylogenetic, phylogeographic, and barcoding studies of Xanthium. Moreover, this is the first report of the development of genetic markers based on these regions and their use to distinguish among Xanthium species. In addition, the nucleotide substitution rate and BEB analyses revealed that the accD gene may be under positive selection, and other positively selected sites detected in the present study may drive the accD PCG, supporting the occupation of various habitats [46, 47]. The earlier studies indicated that the gene accD encoded plastid beta carboxyl transferase subunit of acetyl-CoA carboxylase (ACCase) which is important for the proper chloroplast and as all stages of leaf growth [48], leaf longevity [49], fatty acid biosynthesis [50, 51] and embryo development [52]. Hence, the accD gene may have been involved in adaptation to specific ecological niches during the radiation of dicotyledonous plants [53].

Over the past few years, numerous plastid genome databases have been reported, offering an important foundation for resolving evolutionary, taxonomic, and phylogenetic questions in plants [54,55,56,57,58,59,60]. Our phylogenetic analyses showed that the genus Xanthium is most closely related to the genus Ambrosia. Several previous studies have used various methods including cladistic analyses [61, 62], cp restriction site variation assessments [63], and sequence analyses [11, 64] to understand the position of Xanthium, and these have shown that it is most closely related to Ambrosia species. Previous phylogenetic studies have shown that the genus Parthenium is an early-diverging lineage of the subtribe Ambrosiinae based on three plastid and two nuclear markers. We obtained consistent results, but with weak bootstrap support (57%). Somaratne et al. [6] suggested that Parthenium is not an early-diverging lineage of the subtribe Ambrosiinae, however, their phylogenetic analysis included only 46 cp PCGs. By contrast, we analyzed 71 PCGs in the present study, and the results suggest that Parthenium is an early-diverging lineage of subtribe Ambrosiinae.


We aimed to expand the molecular genetic resources available for the species X. spinosum through high-throughput sequencing and cp genome assembly. The structural characteristics of the X. spinosum cp genome is similar to other angiosperms. However, fourteen highly variable regions were detected and suggested as potential markers for future barcoding and phylogenetic studies of Xanthium species. Hence, the sequence data for the complete X. spinosum cp genome could be used as to distinguish among Xanthium species and resolve the phylogenetic relationships within the Ambrosiinae lineage.


DNA extraction and sequencing of Xanthium spinosum

Leaf material of Xanthium spinosum was obtained from Dr. George A Yatskievych, Curator, Plant Resources Center, University of Texas Herbarium (19–056), Austin, Texas, USA. Total genomic DNA was extracted using a modified cetyltrimethylammonium bromide method [65]. Illumina sequencing was carried out by LabGenomics, Seongnam, South Korea, using the Illumina HiSeq 2500 sequencing system. A paired-end library (150 × 2) was constructed with an insert size of 350 base pairs (bp). Read quality was analyzed with FastQC v0.11.9 [66] and low-quality reads were removed with Trimmomatic 0.39 [67]. The resultant clean reads were filtered using the GetOrganelle v1.6.0 pipeline ( to obtain plastid-like reads, and then the filtered reads were assembled de novo using SPAdes v3.12.0 [68]. The complete cp genome sequence of X. spinosum and its gene annotation were submitted to GenBank (MT668935).

Annotation of X. spinosum cp genome

The online program Dual Organellar GenoMe Annotator (DOGMA) was used to annotate the cp genome sequence of X. spinosum [69]. The initial annotation, putative starts, stops, and intron positions were fine-tuned through comparison with homologous genes in the closely related species X. sibiricum [6]. Transfer RNA genes were validated using tRNAscan-SE v1.21 with the default settings [70]. The program OGDRAW v1.3.1 was employed to draw a circular map of the X. spinosum cp genome [71].

Comparative cp genome analyses

The mVISTA program, which uses the Shuffle-LAGAN model, was employed to compare the cp genome of X. spinosum with three closely related cp genomes from X. sibiricum, Ambrosia artemisiifolia, and Parthenium argentatum using the X. spinosum annotation as a reference [72]. The boundaries between IR and SC regions of these species were also compared and investigated.

Repeat sequence and simple sequence repeats (SSRs) analyses

The program REPuter was used to predict the presence of repeat sequences in the X. spinosum and X. sibiricum cp genomes, including forward, reverse, palindromic, and complementary repeats [73]. The following parameters were used to identify repeats with REPuter: Hamming distance 3, minimum sequence identity of 90%, and repeat size > 30 bp. Phobos software v1.0.6 was employed to identify SSRs in the X. spinosum and X. sibiricum cp genomes; the match, mismatch, gap, and N positions parameters were set to 1, − 5, − 5, and 0, respectively [74]. For repeat and SSR marker analyses, only one IR region was used.

Anaglyses of genetic divergence

To analyze genetic divergence, the PCGs, intergenic, and intron-containing regions of the X. spinosum and X. sibiricum cp genomes were extracted and aligned independently using Geneious Prime v2020.1.2 (Biomatters, New Zealand). Genetic divergence between these Xanthium species was calculated based on nucleotide diversity (π) and the total number of polymorphic sites using DnaSP v5.10.01 [75]. For this analysis, gaps and missing data were excluded.

Characterization of substitution rates

To calculate the synonymous (KS) and nonsynonymous (KA) substitution rates, the cp genome of X. spinosum was compared to that of X. sibiricum. Corresponding single-functional PCG exons were extracted from both genomes and aligned independently using Geneious Prime v2020.1.2 (Biomatters, New Zealand). The aligned sequences were translated into protein sequences and analyzed using DnaSP v5.10.01 to obtain KA and KS substitution rates without stop codons.

Positive selection analyses

Positive selection (M2a and M8) and control (M1a, M7, and M8a) models provided in EasyCodeML software v1.21 [76] were used to identify the occurrence of positive selection (ω > 1) on the accD locus in Heliantheae cp genomes. The sequence of the accD gene was aligned using the program MAFFT v1.4.0 [77], and the maximum likelihood phylogenetic tree was constructed using RAxML v7.2.6 [78]. The site-specific model was used to calculate nonsynonymous (KA) and synonymous substitution (KS) rates using EasyCodeML. The codon substitution models M0, M1a, M2a, M3, M7, M8, and M8a were analyzed. The likelihood ratio test was used to identify positively selected sites in comparisons of M0 (one-ratio) vs. M3 (discrete), M1a (neutral) vs. M2a (positive selection), M7 (β) vs. M8 (β and ω > 1) and M8a ((β and ω = 1) vs. M8 using a site-specific model [76]. The likelihood ratio test (LRT) for these comparisons was used to evaluate the selection strength and p-values of less than 0.05 from the chi-square (χ2) test were considered significant. If the LRT p-values were significant (< 0.05), the Bayes Empirical Bayes (BEB) method was implemented to identify codons under positive selection. BEB values higher than 0.95 and 0.99 indicate sites that are potentially under positive selection and highly positive selection, respectively.

Phylogenetic tree analyses

A phylogenetic tree was constructed using 71 PCGs from 21 Asteroideae cp genomes, with L. fischeri as the outgroup. A total of 20 complete cp genome sequences were downloaded from the NCBI Organelle Genome Resource database. The aligned PCG sequences were saved in PHYLIP format using Clustal X v2.1 [79], and phylogenetic analyses were conducted based on the maximum likelihood (ML) method and the GTRI model using RAxML v7.2.6 with 1000 bootstrap replications [78].