Background

Amomum tsaoko is a perennial evergreen tufted herb of Zingiberaceae; the whole plant has a spicy taste. The dried fruit (also called Cao-Guo in China) of A. tsaoko is an important crude drug in traditional Chinese medicine (TCM), such as ‘Cao-Guo-Zhi-Mu-Tang,’ ‘Da-Yuan-Yin,’ ‘Cao-Guo-Si-Wei-Tang,’ and ‘Li-Gan-Shi-Liu-Ba-Wei-San,’ which clear dampness, resolve phlegm, warm the spleen, and dispel colds. In addition, A. tsaoko is also a top-grade spice known as one of the “five spices” in food seasoning [1]. In recent years, A. tsaoko has been considered to have a broader utilization value and proven to have biological activities, such as anti-oxidation [2, 3], antibacterial [4], anti-inflammation [5], antidiabetic [6], anti-tumor [7], and anticonvulsant properties [8]. In addition, A. tsaoko is a TCM that has been prescribed for the treatment of COVID-19 [9,10,11,12]. As an aromatic Chinese herbal medicine used for both medicinal and edible purposes, essential oil is the most important active ingredient of A. tsaoko, and its content determines the quality of A. tsaoko (Pharmacopoeia of the People’s Republic of China, 2020). Recent studies have shown that A. tsaoko essential oil has a significant inhibitory effect on COVID-19 [13]. The monoterpene 1,8-cineole was found to be the major constituent (34.6–45.24%) of the essential oil in A. tsaoko, and has well-known antiviral, anti-inflammatory, antimicrobial, and pain-relieving effects [14]. In recent years, the cloning and expression analysis of functional genes involved in terpenoid biosynthesis has become a popular topic of research. He et al. [15] used transcriptome sequencing to identify a number of terpenoid synthesis-related genes in Amomum villosum, including five monoterpene synthase genes AvTPS1AvTPS5. However, until now, there have been no reports about regulated genes that are involved in the terpenoid biosynthesis of A. tsaoko.

A. tsaoko is a high-altitude medicinal crop that grows in humid forests, narrowly distributed in the southern Yunnan Province of China and northern Laos and Vietnam at high altitudes between 1100 and 1800 m in mountainous regions [16, 17]. Due to excessive harvesting and the destruction of the original habitat of A. tsaoko, wild resources are almost extinct, and it was listed as a “Nearly Endangered Species” on the IUCN Red List in 2012. Genetic diversity evaluation can provide important reference information for the identification and evaluation of germplasm resources and the selection of elite germplasm. Molecular marker technology is an effective tool for analyzing plant genetic diversity, which has the advantages of abundant quantity, high polymorphism, direct expression in the form of DNA, and is not affected by the environment [18, 19]. SSRs, also known as microsatellite DNA, are short, repeated DNA sequences present throughout the genome. It has the advantages of good reproducibility, codominance, abundant polymorphisms, and easy detection [20]. In recent years, SSR markers have been widely used in genetic diversity analysis [21, 22], linkage genetic map construction and QTL identification [23, 24], and marker-assisted breeding [25, 26].

As a TCM and spice, previous studies of A. tsaoko mainly focused on active component extraction, identification of chemical components, and pharmacological effects [27,28,29,30,31,32]; however, molecular studies of A. tsaoko are lacking. This limits the excellent germplasm selection and breeding utilization of A. tsaoko. Some researchers have begun to study the genetic diversity of A. tsaoko at the phenotypic and molecular levels. Zhang et al. [33] analyzed the phenotypic characteristics of A. tsaoko in nine producing areas of China. The results showed that the number of fruit ridges, the number of seeds per fruit, and the vertical diameter of fruit had the greatest variation. Genetic diversity and population genetic differentiation of A. tsaoko from eight different producing areas were analyzed by 12 RAPD markers [34]. More recently, the genetic diversity of 91 A. tsaoko accessions from southwest China was studied by SRAP and ISSR markers [17]. A few reports on SSR development and the genetic diversity of A. tsaoko have recently been published [1, 35], but the number of markers is insufficient to conduct comprehensive genetic studies.

With the rapid development of next-generation sequencing technology and lower sequencing costs, large-scale RNA-seq provides an important information for functional gene mining and molecular marker development in non-model species [36, 37]. Here, we present the first transcriptome of A. tsaoko using the Illumina HiSeqTM 4000 sequencing platform. The obtained transcriptome data advance our understanding of the function categories from the annotated genes on this species, and the development of EST-SSR markers will provide an important basis for germplasm evaluation, genetic diversity analysis, and molecular breeding of A. tsaoko.

Results

Sequencing and de novo assembly

In this study, 59,876,622 raw reads were obtained using the Illumina HiSeq 4000 platform. After strict quality control, 97.33% clean reads (58,278,226) were obtained with 96.28% Q20 and 94.25% Q30 bases. A total of 199,191 transcripts were obtained by de novo assembly of clean reads with Trinity software. After clustering transcripts and removing redundancy, 146,911 unigenes were obtained, with an average length of 1527 bp and an N50 value of 2002 bp (Table 1).

Table 1 Characterization of Amomum tsaoko transcripts

The assembled unigenes of A. tsaoko were functionally annotated against seven public databases (Table 2, Fig. 1A). A total of 123,420 unigenes (84.01%) were annotated successfully in the NR database, while 95,786 (65.20%) were annotated in the NT database. At the same time, 21,543 unigenes were annotated in all databases, and 128,174 unigenes (87.24%) were successfully annotated with at least one database. The species distribution showed that 80.4% matched Musa acuminata, and 5.6% matched Elaeis guineensis. The matching degrees of Phoenix dactylifera, Nelumbo nucifera, and Vitis vinifera were 4.5, 0.6, and 0.5%, respectively, and 8.3% for other species (Fig. 1B).

Table 2 Functional annotation of A. tsaoko in seven databases
Fig. 1
figure 1

Functional annotation (A) and species distribution (B) of unigenes in Amomum tsaoko

Furthermore, for the KOG classification, 36,621 putative unigenes of A. tsaoko were classified into 25 clusters (Fig. 2). Among these categories, ‘general function prediction only’ (5039; 13.76%); ‘posttranslational modification, protein turnover, chaperones’ (4993; 13.63%); and ‘translation, ribosomal structure and biogenesis’ (3332; 9.10%) were the dominant groups, while only a few unigenes were annotated as ‘cell motility’ (35; 0.09%) and ‘extracellular structures’ (26; 0.07%). Furthermore, there were 89,750 unigenes categorized into three main GO categories: biological processes (233,859; 47.53%); cellular components (145,239; 29.52%); and molecular functions (112,917; 22.95%) (Fig. 3). Within the three categories, ‘binding’ (53,980), ‘cellular process’ (52,634), ‘metabolic process’ (49,492), ‘catalytic activity’ (41,468), and ‘single-organism process’ (37,122) were the most prevalent. Following searches against the KEGG database, 53,059 unigenes were classified into five categories, which were distributed in 130 metabolic pathways. The largest category was comprised of ‘metabolism’ (21,001; 53.69%), followed by ‘genetic information processing’ (10,413; 26.62%), ‘cellular processes’ (2887, 7.38%), ‘organismal systems’ (2559, 6.54%), and ‘environmental information processing’ (2255, 5.77%) (Fig. 4). In addition, 96,740 (65.84%) and 89,573 (60.97%) unigenes matched the SwissProt and PFAM databases, respectively.

Fig. 2
figure 2

Eukaryotic clusters of orthologous groups (KOG) classification of unigenes. A total of 36,621 unigenes with NR hits were grouped into 25 functional groups. The y-axis indicates the number of unigenes in a specific functional cluster. The x-axis indicates the function class

Fig. 3
figure 3

Gene ontology (GO) classification of unigenes. A total of 89,750 unigenes were categorized into three main categories: biological process, cellular component, and molecular function. The x-axis indicates the subgroups in GO annotation, while the y-axis indicates the percentage of specific categories of genes in each main category

Fig. 4
figure 4

Clusters of orthologous groups based on KEGG classification. A Cellular process; B Environmental information processing; C Genetic information processing; D Metabolism; E Organismal systems

Importantly, KEGG analysis showed that 496 unigenes were annotated as being directly involved in the metabolism of terpenoids, including the terpenoid backbone (338 unigenes), monoterpenoid (22 unigenes), diterpenoid (81 unigenes), and sesquiterpenoid and triterpenoid (55 unigenes) biosynthesis pathways (Table 3). In the monoterpenoid biosynthesis pathway, two unigenes encoding alpha-terpineol synthase, 15 unigenes encoding neomenthol dehydrogenase, one unigene encoding 1,8-cineole synthase, and four unigenes encoding linalool synthase were significantly enriched. The FPKM values of Cluster-26,586.35631, Cluster-26,586.48848, Cluster-26,586.69915, Cluster-26,586.65004, Cluster-18,694.1, and Cluster-26,586.96889 were greater than 10 (Fig. 5A). Notably, 1,8-cineole synthase (Cluster-24,134.3) was initially identified in this species. Multiple sequence alignment revealed that Cluster-24,134.3 contained the DDXXD motif, which is conserved in angiosperm monoterpene synthases (Fig. 5B). Cluster analysis showed that Cluster-24,134.3 had the highest homology with the monoterpene synthase gene AvTPS2 in Amomum villosum (Fig. 5C).

Table 3 Overview of the terpene biosynthetic pathways
Fig. 5
figure 5

Unigene analysis of potential monoterpene metabolic pathway in A. tsaoko. Expression of candidate unigenes related to monoterpene synthesis of A. tsaoko (A). Alignment of amino acid sequences of Cluster-24,134.3 (putative 1,8-cineole synthase unigene) with other TPS proteins from Arabidopsis thaliala (AtTPS-Cin), Salvia officinalis (SoTPS-Cin), and Amomum villosum (AvTPS1-AvTPS4) using DNAMAN 8.0 (B). Phylogenetic tree constructed based on amino acid multiple sequence alignment using MEGA 5.2 with the neighbor-joining method (C), dark blue branches represent genus Amomum species (A. tsaoko and A. villosum), and red branches represent other species (A. thaliala and S. officinalis)

Development of EST-SSR markers

SSR loci were identified within the A. tsaoko transcriptome using the MISA software. Among the 146,911 non-redundant unigenes, we identified 55,590 potential SSRs, of which 3794 had a compound formation. There were 10,465 unigenes containing more than one SSR. An average of one SSR site was found every 1.53 kb. The largest fraction of SSRs were mononucleotide SSRs (26, 742, 48.11%), followed by trinucleotide repeats (13,849; 24.91%), dinucleotide repeats (12,716; 22.87%), tetranucleotide repeats (1169; 2.10%), hexanucleotide repeats (566; 1.02%), and pentanucleotide repeats (548; 0.99%) (Fig. 6, Table 4). Among the different SSR repeat-type classes, the most dominant repeat motifs were A/T (25,819; 46.44%), AG/CT (7554; 13.59%), AT/AT (3971; 7.14%), AGG/CCT (3390; 6.10%), AAG/CTT (2709; 4.87%), and CCG/CGG (2330; 4.19%). The remaining motif types accounted for 17.67% of these repeats (Table 4). The tandem repeat numbers of these SSRs ranged from 5 to 86, and 10 tandem repeats (12,910; 23.22%) was the most common number, followed by 5 (7789; 14.01%), 6 (6749; 12.14%), 11 (5655; 10.17%), 12 (3975; 7.15%), and 7 (3792; 6.82%) (Table 5).

Fig. 6
figure 6

Frequency and distribution of EST-SSRs from the transcriptome of A. tsaoko. The most abundant dinucleotide and trinucleotide motifs were AG/CT and AGG/CCT

Table 4 Distribution of EST-SSR loci in the transcriptome of A. tsaoko
Table 5 The length distribution of EST-SSRs based on the number of nucleotide repeat units

Validation of EST-SSR markers

A total of 42,333 primer pairs were successfully designed based on SSR flanking sequences. Four accessions of A. tsaoko were used for primer amplification specificity and efficiency testing. Eighty pairs of primers were randomly selected for synthesis; 64 of the EST-SSRs successfully amplified DNA fragments with the expected band sizes. Eventually, 18 primer pairs yielding the best amplification were chosen to analyze the genetic diversity of 72 A. tsaoko accessions from six cultivated populations (Table 6). SSR amplification profiles of primer pairs AM242, AM272, AM273, and AM278 are shown in Fig. 7. A total of 98 bands were scored, with an average of 5.4 bands per primer and a range of 3 (AM213) to 12 (AM255). The average effective number (Ne) of alleles per locus was 2.961, ranging from 1.311 (AM208) to 4.852 (AM242). The Shannon' information index (I) ranged from 0.477 (AM208) to 1.701 (AM242) with a mean of 1.183. The observed heterozygosity (Ho) ranged from 0.264 (AM208) to 1.000 (AM225) with a mean of 0.594, and the expected heterozygosity (He) ranged from 0.237 (AM208) to 0.794 (AM242) with a mean of 0.613. The value of PIC ranged from 0.223 (AM208) to 0.779 (AM247) with a mean value of 0.580. The HWE analysis showed that six loci (AM224, AM225, AM237, AM247, AM248, and AM278) significantly deviated from HWE (P < 0.05), and the remaining 12 loci (AM203, AM206, AM207, AM208, AM213, AM218, AM223, AM242, AM255, AM272, AM273, and AM279) were in accordance with HWE (Table 7). The frequencies of null alleles were low (< 0.20) for each locus, except AM247.

Table 6 Characterization of 18 EST-SSRs among 72 A. tsaoko accessions
Fig. 7
figure 7

Electrophoretic profiles of 48 A. tsaoko accessions using four pairs of EST-SSR primers. Denatured PCR products were separated using an 8% non-denaturing polyacrylamide gel and then stained with 1% silver nitrate

Table 7 Allelic diversity of 18 EST-SSR markers used in 72 accessions of A. tsaoko

Population genetic diversity

At the population level, the YY population scored higher values in Ne, I, and He, while Na and Ho were higher in the LVC and LC populations, respectively. Notably, the number of private alleles (PAr) in the JP and LVC populations (0.167) was higher than that in the PB, YY, LC, and YX populations. Mean fixation index (F) values ranged from − 0.145 in the YX population to 0.015 in the YY population, with an average of − 0.066 (Table 8). The samples did not cluster according to the six different sampling sites based on the UPGMA cluster and PCoA, and all samples from different populations were not separated from each other (Fig. 8). The studied populations showed relatively low between-population genetic differentiation (Fst average = 0.052; min = 0.029, max = 0.091) and small genetic distance (GD average = 0.155, min = 0.090, max = 0.263) (Table 9). Consistent with these results, AMOVA also showed that 90% molecular variance was found within populations (Table 10).

Table 8 Genetic diversity of A. tsaoko populations based on EST-SSR markers
Fig. 8
figure 8

Cluster analysis and principal coordinate analysis (PCoA) based on the genotyping of 72 accessions of A. tsaoko

Table 9 Pairwise comparison of FST (above diagonal) and genetic distance (below diagonal) for six A. tsaoko populations
Table 10 Analysis of molecular variance (AMOVA) in A. tsaoko from six populations

Cross-species transferability

To test the transferability of the novel EST-SSR markers in different Zingiberaceae species, 18 polymorphic EST-SSR markers were tested for amplification in 11 other species. Fourteen primer pairs were successfully amplified, and the transferability rates (TR) ranged from 27.78% (Alpinia coriandriodora) to 77.78% (Alpinia zerumbet and Curcuma phaeocaulis), with an average of 68.18% (Fig. 9A). Furthermore, UPGMA phylogenetic analysis based on transferable SSR markers showed that 12 Zingiberaceae species clustered into two groups (Fig. 9B): the first class (the green shaded area indicating the genera Amomum and Alpinia) and the second class (the red shaded area indicating the genera Kaempferia, Hedychium, and Curcuma).

Fig. 9
figure 9

Cross-species transferability of 14 polymorphic EST-SSR markers in different Zingiberaceae species. The transferability rates of EST-SSR markers in 11 Zingiberaceae species (A). UPGMA phylogenetic analysis of 12 Zingiberaceae species based on transferable SSR markers (B)

Discussion

Characterization of the Amomum tsaoko transcriptome

For species without reference genomes, transcriptome sequencing is considered the most effective way of mining functional genes and developing novel molecular markers [38, 39]. The results of this study represent the first report of transcriptome analysis and EST-SSR detection in A. tsaoko. To maximize the transcriptome information gained, mixed samples of five organs, namely roots, stem, leaves, flowers, and fruit, were collected from A. tsaoko for paired-end transcriptome sequencing. A total of 58,278,226 clean and high-quality reads with a 94.25% Q30 level, which ensures the quality of sequencing. In the present study, the N50 sizes of generated unigenes (2002 bp) were obviously longer than those in the Zingiberaceae family, such as A. villosum (N50 = 1381 bp) [15], C. wenyujin (N50 = 1566 bp) [40], Curcuma longa (N50 = 1515 bp) [41], Curcuma alismatifolia (N50 = 1501 bp) [42], Zingiber officinale (N50 = 1077 bp) [43], and Elettaria cardamomum (N50 = 616–664 bp) [44].

Among them, 128,174 unigenes were annotated to 7 major databases, such as GO, KEGG, NR, and KOG, accounting for 87.24% of the total unigenes. These annotated unigenes provide a reference basis for the study of metabolic pathways, gene function classification, plant hormone signal transduction, and quality character analysis of A. tsaoko. In addition, there are 18,737 unigenes unannotated. It is speculated that these unigenes may be specific new genes of A. tsaoko or that the non-coding RNA sequence and public gene database are imperfect [45,46,47]. Based on NR alignment results, most of the unigenes were annotated to Musa acuminata (80.4% similarity), similar to the report in Curcuma alismatifolia (Zingiberaceae family) (80.4% similarity) [42]. Zingiberaceae (A. tsaoko) and Musaceae (Musa acuminata) belong to Zingiberacea, and both are tropical plants, which may explain the large number of homologous genes between them [48]. In addition, 8.3% of the homologous sequences have not been matched, which may because some unigene fragments are too small to match a single data sequence. A. tsaoko can be divided into 3 major categories and 55 subcategories in GO functional classification, mainly focusing on cellular part, catalytic activity, and metabolic process. The same outcome was found in the transcriptomic study of A. villosum from congeneric species [15]. Volatile terpenoids are the active metabolites in the essential oil of A. tsaoko, but the terpene synthases (TPS) responsible for their biosynthesis remain unknown. In our study, the KEGG enrichment analysis found that 496 unigenes were enriched in the metabolic pathways of terpenoids, which will provide a better understanding of terpenoid biosynthesis in A. tsaoko. The DDXXD conserved motifs contained in the amino acid sequence of Cluster-24,134.3 binds catalytic substrates (such as GPP) through complexation with metal ions (such as Mg2+), which is shared by ionization-dependent terpenoid synthases belonging to the monoterpene synthase functional domain [49, 50]. The phylogenetic tree based on amino acid sequences showed that Cluster-24,134.3 had the highest homology with the AvTPS2 identified in A. villosum, which is in the same genus, and was quite different from the 1,8-cineole synthase gene of Arabidopsis and Salvia officinalis. This may be because the phylogeny and lineage differentiation of angiosperm TPS are closely related to the natural plant classification system, and the similarity of TPS from similar species is much higher than that of TPS from different sources with the same function [51]. Follow-up cloning and functional verification of the corresponding genes of Cluster-24,134.3 are necessary.

Frequency and distribution of EST-SSRs

Molecular marker technology has been widely used in various fields of plant sciences, including germplasm resource identification, genetic diversity, new variety breeding, and map construction [18]. Among them, SSRs are known as a marker type widely used at present. Furthermore, SSRs can be divided into two categories: genomic and expressed sequence tags (EST-SSRs). The construction of a transcriptome platform has promoted the development of DNA molecular markers to a great extent. To date, there are only a few reports about molecular markers in A. tsaoko [1, 17, 35]. Recently, 123 SSRs were identified in the A. tsaoko chloroplast genome; the mean density of SSR loci is up to 1/1.33 kb; however, they have not been validated for A. tsaoko [48]. In recent years, the use of transcriptome data to obtain sequences containing microsatellites, and the study of their genetic diversity has been successfully reported [52,53,54]. Excluding mononucleotide repeats, trinucleotides were found to be the most abundant repeats in the present study. This result was consistent with previous studies in other plants, such as Curcuma alismatifolia [42], Vicia amoena [22], and Pseudotaxus chienii [55]. Importantly, the abundance of trinucleotides in coding regions does not change the coding frame and therefore may not affect the functions of the genes [56, 57]. The most abundant dinucleotide SSR motifs were AG/CT (59.41%), a similar finding has been reported in other plant genomes [42, 53, 58, 59]. This may be because alanine (Ala) and leucine (Leu) are the amino acids with the highest frequency in proteins, while the AG/CT dinucleotide motif is present in the codons of Ala and Leu [60]. Moreover, some studies have shown that the CT motif frequently occurs in the 5′ UTRs; this motif plays an important role in regulating nucleic acid metabolism and gene expression in plants [61, 62]. Among trinucleotide repeat motifs, AGG/CCT (24.48%) occurred most commonly, agreeing with the results reported in Zantedeschia rehmannii [63], Phragmites karka [64], Amorphophallus [65], and Aspidistra saxicola [66]. It has been suggested that AGG/CCT interruptions play a protective role in genomic packaging [67, 68].

Validation of EST-SSR markers

In our study, 64 primer pairs (80%) were successfully amplified among the 80 randomly selected EST-SSRs (designated AM201-AM280) across 4 A. tsaoko accessions. Some studies have pointed out that, compared with genomic SSRs, EST-SSRs have a better amplification effect, but the polymorphism rate is relatively low [69, 70]. However, these newly developed EST-SSRs (Na = 5.444; Ho = 0.594; He = 0.613) have higher resolutions than microsatellite markers (Na = 2.000; Ho = 0.250–0.372; He = 0.432–0.456) [35] and genomic SSR markers (Na = 3.913; Ho = 0.468; He = 0.500) [1] in A. tsaoko. This phenomenon has also been reported in poplar [71]. PIC is one of the main indices used to evaluate locus polymorphisms, including high (PIC > 0.5), moderate (0.5 > PIC > 0.25), and low (PIC < 0.25) polymorphism [72]. Among the 72 accessions, the PIC values indicated a good informative level of these EST-SSRs, including 13 highly polymorphic (AM203, AM206, AM223, AM224, AM225, AM237, AM242, AM247, AM248, AM255, AM272, AM273, and AM278), four moderately polymorphic (AM207, AM213, AM218, and AM279), and only one low polymorphic (AM208) markers. EST-SSR markers exhibit much higher polymorphism than RAPD, ISSR, SRAP, and genomic SSR markers for A. tsaoko [1, 17, 34]. HWE analysis showed that 12 primer pairs (AM203, AM206, AM207, AM208, AM213, AM218, AM223, AM242, AM255, AM272, AM273, and AM279) were in accordance with HWE, while the remaining primer sets (AM224, AM225, AM237, AM247, AM248, and AM278) significantly deviated from HWE (P < 0.05). Taken together, the newly developed EST-SSRs are suitable for exploring the genetic diversity and relationships of A. tsaoko. These molecular markers will be effective tools in fingerprinting germplasm collections from different sources to guide germplasm evaluation and conservation in A. tsaoko. We also analyzed the characteristics of SSR loci within the terpenoid metabolism-related unigenes (Table S1). Next, we will develop molecular markers around these SSR loci, which is very important for gene mining and marker-assisted selection breeding for controlling important traits in A. tsaoko.

Genetic diversity of the A. tsaoko population

Genetic diversity is an important indicator of the effective management and utilization of plant germplasm resources. The level of genetic diversity is often correlated with its range of distribution; the wide distribution range of plants usually shows higher levels of genetic diversity compared with narrowly distributed plants [73, 74]. However, while the known distribution of A. tsaoko is narrow, it retains relatively high levels of genetic diversity at the population level (PPB = 98.150%, I = 1.019, Ho = 0.594, and He = 0.560). Similar results have been reported for many other plant species, such as Petunia secreta [75], Nouelia insignis [76], and Calanthe tsoongiana [77]. In our study, Jingping and Lvchun counties, as possible places of origin of A. tsaoko, harbored the highest number of private alleles (PAr = 0.167), which indicates the presence of specific sequences or genes in JP and LVC populations [78]; this was also consistent with our previous findings [17]. The overall mean value of the fixation index was − 0.066, showing an excess of heterozygosity present in the A. tsaoko population; this result is in agreement with a previous study carried out with genomic SSR markers [1]. The pairwise FST analysis (average pairwise FST = 0.052) detected no significant genetic differentiation among six A. tsaoko populations. Correspondingly, the AMOVA results also showed that much more genetic variance exists among individuals within populations. High genetic diversity, excess heterozygotes, and low genetic differentiation are characteristics of outbreeding populations [79]. A. tsaoko is a perennial cross-pollinating plant pollinated by insects (native bumblebee), allowing allele exchange among individuals in the population [17]. In this study, the accessions did not cluster according to different sampling sites by UPGMA cluster, and the PCoA analysis further confirmed low genetic differentiation among populations.

Compared with bulk crops, the molecular markers available for Zingiberaceae are very limited. In our study, the ESR-SSR markers developed in A. tsaoko have high cross-species transferability (with an average transferability of 68.18%) with other Zingiberaceae species, and these markers can accurately classify different species, making them useful for increasing the number of available molecular markers of Zingiberaceae species.

Conclusions

In this study, we reported the first transcriptome sequencing analysis of A. tsaoko using Illumina sequencing technology. The de novo assembly generated a total of 146,911 unigenes with an average length of 1527 bp and an N50 of 2002 bp. In total, 128,174 unigenes were successfully annotated to the NR (84.01%), NT (65.20%), KEGG (36.11%), Swiss-prot (65.84%), PFAM (60.97%), GO (61.09%), and KOG (24.92%) databases. 496 unigenes involved in the terpenoid biosynthesis pathway were identified. Based on the transcriptome assembly, 55,590 potential EST-SSRs were identified and characterized. Seventy-two A. tsaoko accessions using 18 novel polymorphic EST-SSR primers showed rich genetic diversity and low genetic differentiation among populations. The transcriptome data from this study will provide valuable resources for investigating gene functions, and the EST-SSR markers developed will provide a foundation for germplasm identification, genetic diversity assessment, and marker-assisted breeding in A. tsaoko.

Methods

Plant material

The transcriptome sequencing materials were collected from the young roots, stems, leaves, flowers, and fruits of 5-year-old A. tsaoko plants in Caoguoshan Village, Adebo Township, Jinping County, Honghe Prefecture, Yunnan Province, China (22°54′30.34″N, 103°13′16.39″E). To verify EST-SSR markers, young leaves from a total of 72 A. tsaoko accessions were collected from different sites covering the plant’s major distribution areas in China (Fig. 10; Table S2). For cross-species transferability analysis, 12 Zingiberaceae species were used, which consisted of two species in the genus Amomum (A. tsaoko and A. villosum), two species in Kaempferia (K. galangal and K. rotunda), two species in Hedychium (H. flavum and H. coronarium), and three species in Curcuma (C. kwangsiensis, C. caesia, and C. phaeocaulis) (Table S3). No approval or permission was required to collect these samples. All the species were identified by Professor Bingyue Lu (Honghe University). The genomic DNA was extracted from leaf tissues using a modified cetyltrimethylammonium bromide (CTAB) method [80].

Fig. 10
figure 10

Geographic locations of the 6 populations of A. tsaoko included in this study

RNA isolation, library preparation, and transcriptome sequencing

Total RNA was extracted from three biological replicates (with two technical replicates per each biological replicate) from mixed samples of the five different organs using the RNeasy extraction kit (QIAGEN, Beijing, China). The integrity and purity of extracted RNA were measured by a NanoDrop spectrophotometer and agarose gel electrophoresis. After total RNA extraction, eukaryotic mRNA was enriched with Oligo (dT) beads. Next, fragmentation buffer was added to break mRNA into short fragments. Then, these mRNA fragments were used to synthesize first-strand cDNA with hexameric random primers. Subsequently, second-strand cDNA was synthesized by adding DNA polymerase I, RNase H, dNTPs, and buffer. The purified double-stranded cDNA was repaired at the end, dA-tailed fragments were added, and it was connected to the sequencing connector. After library inspection, the quality-qualified libraries were sequenced using the Illumina HiSeq platform (Novogene Biotech Co., Ltd., Beijing, China).

De novo assembly and annotation

After sequencing, raw data were filtered, and high-quality clean data were obtained by removing the joint sequence and low-quality reads. High-quality reads were used for de novo assembly using Trinity software [81]. Finally, we obtained sequences that could not be extended on either end, which were considered unigenes. The resulting clean data were deposited in the Sequence Read Archive of National Center for Biotechnology Information (NCBI) (Bioproject no. PRJNA735890; Biosample no. SAMN19601270). The assembled A. tsaoko transcripts were compared with the NCBI non-redundant protein sequences (NR), NCBI non-redundant nucleotide sequences (NT), protein family (PFAM), eukaryotic clusters of orthologous groups (KOG), a manually annotated and reviewed protein sequence (SwissProt), and Kyoto Encyclopedia of Genes and Genomes (KEGG) databases using the BLASTX algorithm (significant thresholds of E-value < 10− 5), and Gene Ontology (GO) annotations were obtained using Blast2GO software (http://www.geneontology.org) based on NR annotations.

Development and characterization of EST-SSRs

Based on the transcriptome sequencing data of A. tsaoko, MISA software (https://webblast.ipk-gatersleben.de/misa/) was used to search the SSR loci of A. tsaoko. The search criteria were as follows: mono- to hexa-nucleotide repeats ≥10, 6, 5, 5, 5, and 5; mono- to hexa-nucleotide repeat types are labeled P1, P2, P3, P4, P5, and P6. Primer3 (http://primer3.sourceforge.net/releases.php) was used to design SSR primers based on the detected sites.

SSR analysis

Eighty pairs of EST-SSR primers were randomly synthesized, and four A. tsaoko germplasm resources with different fruit shapes (round type, oval type, long type, and shuttle type) were selected to verify the effectiveness and polymorphisms of the primers. The primers producing clear bands and obvious polymorphisms were subsequently used for genetic diversity assessments. The PCR reaction system, amplification program, and gel electrophoresis were as previously described [1].

Data analysis

According to the polyacrylamide electrophoresis results, at the same migration location, a band is marked as “1,” no band is marked as “0,” and a missing band is indicated by “-”. Subsequently, 0/1 data were preprocessed using DataFormater software [82], which transformed SSR data into readable input files for NTSYS-pc and PowerMarker. The GenAlEx 6.501 program [83] was employed to calculate genetic diversity parameters, namely the observed number of alleles (Na), the effective number of alleles (Ne), Shannon’s information index (I), expected heterozygosity (He), observed heterozygosity (Ho), number of private alleles (PAr), and fixation index (F), which was used to assess Hardy–Weinberg genetic equilibrium. PowerMarker v3.25 [84] was used to estimate the polymorphism information content (PIC) of each EST-SSR primer pair. We estimated the frequencies of null alleles (FNA) using Micro-checker 2.2.3 [85]. The genetic differentiation between populations was carried out using analysis of molecular variance (AMOVA), pairwise FST, and pairwise Nei’s genetic distance in GenAlEx software. Cluster analysis (unweighted pair group method with arithmetic mean algorithm) and principal coordinate analysis (PCoA) were performed using NTSYS-pc software [86] and GenAlEx v6.5, respectively. Multiple amino acid sequence alignment and phylogenetic analysis was performed using DNAMAN 8.0 and MEGA version 5.2 software. The amino acid sequences of the 1,8-cineole synthase gene of Arabidopsis thaliala (GenBank accession no. NM_113483.5) and Salvia officinalis (GenBank accession no. AF051899.1) were downloaded from the NCBI (http://www.ncbi.nlm.nih.gov/). The amino acid sequences of four monoterpene synthase genes (AvTPS1, AvTPS2, AvTPS3, and AvTPS4) of A. villosum were obtained from the published literature [87].