RNA-sequencing and de novo transcriptome assembly
A total of 14.3 Gbp reads from the nine tissues of G. sinensis were checked for quality and then applied to the transcriptome analysis. About 95% of the raw reads were of good quality (quality value (QV) > 30). The high-quality cleaned reads were used for de novo transcriptome assembly with Trinity. We obtained 230,780 contigs and estimated the expression levels with Bowtie2 in RSEM. The 81,511 contigs were selected as unique transcripts (unitranscripts) after removing the genes with low gene expression of FPKM < 1. Moreover, the longest isoforms were selected from each of the unitranscripts as unigenes, because the splicing variants were included in the unitranscripts. As a result, the average, N50, and maximum lengths of the 47,855 unigenes were 1103 bases, 1952 bases, and 17,250 bases, respectively, and their GC content was 40.7% (Table 1).
The Blast2GO searches successfully annotated the functions of 31,717 of the 47,855 unigenes (66.3%). The distribution of the species of the entries is shown in Fig. 2a, and 80% of the annotated unigenes were highly similar to genes of Fabaceae family members (Cajanus cajan, Glycine max, Lupinus angustifolius, and Glycine soja). The top 20 annotated gene ontology terms at the level 3 annotation with three categories (biological process, molecular function, and cellular component) are shown in Fig. 2b. The top five GO biological process terms were involved in metabolic processes (organic substance metabolic process (GO: 0071704), cellular metabolic process (GO: 0044237), primary metabolic process (GO: 0044238), nitrogen compound metabolic process (GO: 0006807), and biosynthetic process (GO: 0009058). Organic cyclic compound binding (GO: 0097159), heterocyclic compound binding (GO: 1901363), ion binding (GO: 0043167), transferase activity (GO: 0016740), and small molecule binding (GO: 0036094) were the top five GO molecular function terms. Intracellular (GO: 0005622), intracellular part (GO: 0044424), intracellular organelle (GO: 0043229), intrinsic component of membrane (GO: 0031224), and membrane-bounded organelle (GO: 0043227) were the top five GO cellular component terms.
To estimate the abundance of expression of genes in the nine tissues, all the cleaned reads were mapped to the contigs constructed by de novo transcriptome assembly using Bowtie2 in RSEM. The raw count was normalized between tissues using the trimmed-mean of M normalization (TMM) method and transformed to counts per million (CPM) using edgeR. In order to compare the gene expression patterns among the nine tissues, we performed principal component analysis (PCA) using the prcomp function in R. PCA revealed that the nine tissues were clustered along developmental organs (Fig. 3). Two major components accounted for 41% of the gene expression variance. Along the PC2 axis, the results showed that gene expressions in the leaf and young leaf (yellow and gray) were significantly different from those in the other tissues. Along the PC1 axis, the other tissues were clustered into three new groups: branch, wood, and bark (blue, pink, and red); fruit and stalk (orange and brown); and bud and flower (green and purple). These results showed that the gene expression dynamics differed among the nine tissues.
Next, we analyzed the tissue specificity of the highly expressed genes in each of the nine tissues based on the tissue specificity index (tau). The tau value ranged from 0 to 1, where 1 means that a gene is highly expressed in only one tissue . We identified 3959 genes (8.2%) with tau > 0.95 as tissue-specific genes. A heatmap displayed differential expression patterns of the tissue specific genes in the nine tissues: 77 genes in young leaf, 228 genes in leaf, 1335 genes in branch, 180 genes in wood, 381 genes in bark, 955 genes in bud, 421 genes in flower, 41 genes in stalk and 341 genes in fruit (Fig. 4). Many branch-specific genes were expressed in bark, and many of bud-specific genes were expressed in flower. On the other hand, the fruit-specific genes were not highly expressed in the other tissues. In order to clarify the different biological functions associated with the different tissue-specific genes, we performed GO enrichment analysis. The top 10 GO terms in fruit-specific genes are shown in Figs. 5a–c. Figure 5a shows that five ontology terms were significantly enriched in biological process (BP). Top three enriched terms were flavonoid metabolic process (GO: 0009812, p value = 3.58e−10), flavonoid glucuronidation (GO: 0052696, p value = 2.02e−5), and flavonoid biosynthetic process (GO: 0009813, p value = 5.37e−5). In the molecular function (MF) terms, Fig. 5b shows that 12 GO terms were significantly enriched with p values < 0.01 and top three enriched terms were UDP-glycosyltransferase activity (GO: 0035251, p value = 4.26e−9), lipase activity (GO: 0016298, p value = 5.42e−6), and hydrolase activity acting on ester bonds (GO: 0016788, p value = 4.34e−5). Figure 5c shows that intercellular membrane-bounded organelle (GO: 0043231, p value = 8.84e−4) was the most significantly enriched term in cellular component (CC). All significant GO terms in the other tissues are presented in Supplementary Table S1.
We analyzed the content of saponins among the nine tissues using LC–MS/MS. The base peaks of the metabolite extracted from fruits are shown in Fig. 6a. Gleditsioside I was the most highly detected saponin in the fruit of G. sinensis. Gleditsioside I was previously isolated from fruit , which was consistent with our result. The profiles in each tissue of the top 10 saponins with the highest contents in fruit are shown in Fig. 6b. Almost all the saponins were detected at higher levels in fruit, bud, and flower.
Identification of genes involved in saponin biosynthesis
The mevalonate (MVA) and methylerythritol phosphate (MEP) pathways are essential biosynthetic processes for formation of the triterpenoid backbone (Fig. 7). The heatmap represents the expression profile of putative genes associated with the MVA and MEP pathways (Fig. 7). In the MVA pathway, we annotated three acetyl-CoA acetyltransferases (AACT, EC:188.8.131.52), three HMG-CoA synthases (HMGS, EC:184.108.40.206), four HMG-CoA reductases (HMGR, EC:220.127.116.11), three mevalonate kinases (MVK, EC:18.104.22.168), one phosphomevalonate kinase (PMK, EC:22.214.171.124), and three mevalonate-5-diphosphate decarboxylases (MVD, EC:126.96.36.199). We also annotated within the MEP pathway which MEP is converted to isopentenyl diphosphate (IPP) and dimethylallyl diphosphate (DMAPP), ten 1-deoxy-d-xylulose 5-phosphate synthases (DXS, EC: 188.8.131.52), three 1-deoxy-d-xylulose 5-phosphate reductases (DXR, EC: 184.108.40.2067), one 2-C-methyl-d-erythritol 4-phosphate cytidylyltransferase (MCT, EC: 220.127.116.11), one 4-(cytidine 5′-diphospho)-2-C-methyl-d-erythritol kinase (CMK, EC: 18.104.22.168), two 2-C-methyl-d-erythritol 2,4-cyclodiphosphate synthases (MDS, EC: 22.214.171.124), one 4-hydroxy-3-methylbut-2-enyl diphosphate synthase (HDS, EC: 126.96.36.199), and seven 4-hydroxy-3-methylbut-2-enyl diphosphate reductases (HDR, EC: 188.8.131.52). To increase the structural diversity of triterpenoids, triterpenoids were modified with hydroxylation by cytochrome P450 monooxygenases (P450s) and glycosidation by UDP-glycosyltransferases (UTGs) . The P450s form a large family in the plant genome. In recent studies, several P450s were identified as the enzymes involved in saponin biosynthesis. The CYP93E subfamily catalyzes the C24-hydroxylation of beta-amyrin in Glycine max , the CYP88D subfamily catalyzes the two-step oxidation of beta-amyrin at C11 , and the CYP72A subfamily catalyzes the hydroxylation of C22 in 24-hydroxy-beta-amyrin . From our transcriptome data, 136 P450s and 77 UGTs were annotated using HMMER against the Pfam database (Supplementary Table S2). Among these candidates, 26 P450s and 10 UGTs were highly similar to known P450 family members (CYP51, CYP71, CYP716, CYP72, CYP88, and CYP93) and UGT family members (UGT71, UGT74, UGT91, UGT94) involved in saponin biosynthesis. Figure 8a, b show the gene expression patterns of the candidate P450s and UGTs. Our metabolomics analysis and GO enrichment analysis of tissue-specific genes suggested that one of the main tissues of saponin biosynthesis would be fruit. Moreover, we screened annotated unigenes as P450s and UGTs with high levels of expression in fruit. Seven P450s (Fig. 8a) and one UGT (Fig. 8b) that were highly expressed in the fruit of G. sinensis were identified as candidate genes involved in the biosynthesis of triterpenoid saponins.