Background

Gastrodia elata (G. elata) is a typical heterotrophic plant for traditional Chinese medicine, which has been widely used in clinic. It belongs to the genus of Gastrodia R. Br. and the family of Orchidaceae with more than 20 synonyms. G. elata is mainly distributed in the areas of Asia, including China, Japan, Korea, and India [1]. G. elata is a special medicinal plant, its seeds have no endosperm, and its roots and leaves are highly degraded. It cannot absorb nutrients directly from the soil or synthesize required substances through photosynthesis. The growth and development cycle of G. elata includes seed, protocorm, juvenile tuber, immature tuber, mature tuber, scape and flower, about 80% of its growth cycle is underground with two fungus A. mellea and Mycena [2, 3]. Mycena offers nutrition for the seed germination of G. elata, and A. mellea offers nutrition and energy for the vegetative propagation corms of G. elata development into tubers [3, 4].

G. elata has many pharmacological effects, such as reducing hypertension [5], antioxidant activity [6], antiaging [7], antitumor [8] and immunomodulatory effect [9]. Several ingredients have been identified from G. elata including gastrodin, vanillin, vanillyl alcohol, p-endoxybenzyl alcohol, glycoprotein, flavonoid, polysaccharides, etc [10]. Gastrodin is a one of the active component in root of G. elata, which has been shown to have a protective effect for neurons hypoxia injury [11]. Polysaccharides extracts from G. elata can also attenuate vincristine-evoked neuropathic pain [12]. In addition, G.elata is also used as medicine food homology, especially in northwest of China [13]. The dry tuber of G. elata has been used for centuries in traditional Chinese medicine, which is considered to be dispels wind, hyperactive liver and dredges collaterals [14]. Otherwise, the Chinese patent medicines with G. elata are also widely used in clinic and present positive effects. For example, Tianma Gouteng drink, as a traditional Chinese medicine prescription, has been used clinically to treat cerebral infarction [15]. Banxia Baizhu Tianma decoction is another representative prescription, which has the effect of invigoration the spleen and expectoration phlegm [16]. All these pharmacological effects and functions of G. elata cannot be achieved without the active components. Therefore, G. elata is a valuable medicinal plant and it is necessary to analyze and explore the key genes regulating the active component accumulation to improve the medicinal value for demand in the future.

With the development of high-through technology, massive data of G. elata was accumulated. Since 2018, four genome assemblies of G. elata have been released. Sequencing and annotation of G.elata genome has been completed by Yuan et al in 2018 [2]. Based on G.elata genome in 2018, we constructed a basic edition platform for gene function analysis of G.elata (GelFAP) [17]. An improved version of G.elata genome has been accomplished by Chen et al in 2020 [18]. Recently, a high-quality chromosome-level genome sequence of G.elata in China has been decoded by Xu et al. [19]. Bae et al. also reported a chromosome level genome of G. elata [20]. Improvement and availability of different genomes of G.elata can provide an invaluable resource to investigate biosynthesis of its active components. Here, we constructed a new version of gene function analysis platform of G.elata based on the chromosome level genome published by Xu et al., which will provide a reference for users to carry out studies on gene function and active component synthesis pathway.

Materials and methods

Data resource and functional annotation

Genome data of G. elata were derived from National Genomics Data Center (NGDC) (Accession number: GWHBDNU00000000), 45 transcriptome samples in this study were downloaded from Short Read Archive (SRA) database (http://www.ncbi.nlm.nih.gov/sra) and 6 samples was produced by our group (Table S1). GO annotation was collected from Gene Ontology Consortium [21] and KEGG annotation was predicted by GhostKOALA [22]. Sequence of The Ethylene-responsive element binding factor-associated Amphiphilic Repression (EAR) motif-containing proteins and CAZy (Carbohydrate Active Enzyme) proteins were derived from PlantEAR [23] and GAZy database [24] respectively.

Co-expression network construction

We firstly mapped the transcriptome data to reference genome by hisat2 software [25], TPM (Transcripts Per Million) in each sample was calculated by StringTie software [26]. Secondly, Pearson correlation coefficient (PCC) value between each genes was evaluated by the in house Perl script, we then defined the co-expression network according to the scale free model fit index (R2) and nodes number. For the R2 less than 0.9, we defined the co-expression network by the best R2. For R2 more than 0.9, we defined the co-expression network by the largest nodes number. Integration of co-expression network with expression profiles enables effectively analysis of gene functions. Here, differential expressed genes analysis in G. elata transcriptome samples was performed and then integrated into the presentation of gene co-expression network.

Protein-protein interaction (PPI) network construction

As our previous study, rice and maize PPI network were collected from public database RicePPINet [27] and PPIM [28] respectively. To construct G. elata PPI network, we also performed orthologous relationship prediction between rice and G. elata with a cutoff over 60% bootstrap by InParanoid software [29], as well as maize and G. elata. Then we mapped the PPI network in rice and maize to G. elata.

Gene family identification

We firstly used InPranoid software [29] to predicted orthologous relationship of proteins between Arabidopsis and G.elata, and further identified CAZy and EAR motif-containing proteins based on orthologous relationship. Using iTAK (Plant Transcription Factor & Protein Kinase Identifier and Classifier) software (http://bioinfo.bti.cornell.edu/cgi-bin/itak/index.cgi) [30], we identified and classified transcription factors and protein kinases in G.elata. Based on a hidden Markov model obtained from iUUCD v2.0 (an integrated database of regulators for ubiquitin and ubiquitin-like conjugation, http://iuucd.biocuckoo.org/) [31], ubiquitin families in G.elata were identified. Annotation of KEGG pathways for the whole genome were accomplished with GhostKOALA [22]. On the basis of KEGG annotations, CYP450 genes were functionally annotated.

Construction of GelFAP v2.0

Based on the LAMP (Linux, Apache, MySQL, PHP) technical stack, the platform was constructed. A MySQL database was created by importing all relevant results and data analysis, including gene structure annotation, gene functional annotation, co-expression network, PPI network and gene family classification. Html, PHP, Javascript and CSS languages were used to construct dynamic websites for data presentation and analysis.

Toolkit for gene function analysis

We introduced gene set enrichment analysis (GSEA) [32] and cis-element enrichment analysis tool as described previously [33, 34]. ViroBlast [35] was used for the construction of Blast analysis. Buels et al  developed JBrowse software [36] for the exhibition of omics information, which we also introduced into the platform. We also developed a sequence extraction tool by perl script and induced Heatmap analysis tool by Highchart Javascript.

Results

Gene structure and functional annotation

We firstly collected genome information of G.elata from the NGDC database, including 19,493 genes, 33,561 transcript and 33,561 proteins. By aligning proteins sequence with NR, TAIR, Uniprot and Swissprot database, we annotated 17,121, 14,640, 17,085, 13,070 genes respectively. We also annotated 12,720 genes with GO annotation by InterProScan software [37]. 3988 genes KEGG description was annotated by using GhostKOALA online tools [22] in Kyoto Encyclopedia of Genes and Genomes (KEGG) database (https://www.kegg.jp/) [38,39,40]. 13,600 genes were subjected to functional annotation of protein domains by the means of the PfamScan software [41] (Fig. 1A).

Fig. 1
figure 1

Overview of functional annotation and network construction. (A) The number of gene sequences and annotation. (B) Gene numbers in different gene families. (C) Numbers of gene pairs as PCC changing. (D) Nodes number and scale-free model fit (R2) distribution as change of PCC. (E) Nodes number and scale-free model fit (R2) distribution in the negative co-expression network as change of PCC. (F) Edges and nodes statistics in the positive, negative co-expression network and PPI network.

Gene family classification

Firstly, iTAK software was used to analyze the transcription factors (TFs), transcription regulators (TRs) and protein kinases (PKs) in G.elata and 1273 potential TFs, 999 TRs and 274 PKs were predicted. Secondly, a total of 689 ubiquitin-proteasome coding genes were predicted based on the hidden Markov model (HMM) of the ubiquitin-proteasome downloaded from the iUUCD v2.0 database. Thirdly, All the genes were aligned to the PlantEAR and CAZy database, 716 and 295 genes were assigned to the EAR motif-containing and CAZy families respectively (Fig. 1B).

Co-expression network

Transcriptome samples from SRA and our group were used to construct co-expression network in G.elata. The expression value of each gene was calculated in each sample. We further constructed a expression matrix of genes and calculated the Pearson correlation coefficient (PCC) between each two genes in G.elata. PCC algorithm is used to calculate the correlation between every two gene expression, and normalization has no effect on the correlation. The distribution of PCC and gene pairs shown that gene pairs with high correlation are mainly concentrated in middle part (Fig. 1C). By examining the scale-free model fit index (R2) for co-expression networks at different cutoff of PCC value, the positive and negative co-expression network were constructed at an appropriate threshold of PCC. The distribution of the highest R2 suggested that the PCC > 0.75 was the best threshold for the positive co-expression network (Fig. 1D). We constructed a positive co-expression network with 917,700 edges and 16,292 nodes (Fig. 1F). Different with the positive co-expression network, the scale-free model fit index (R2) of negative co-expression network in PCC from − 0.65, -0.7, -0.75 were greater than 0.9, however, the coverage of nodes was the highest when PCC<-0.65 (Fig. 1E). Therefore, PCC less than 0.65 was selected to construct the negative co-expression network. Finally, a negative co-expression network with 146,300 edges and 10,636 nodes was constructed (Fig. 1F).

Protein-protein interaction (PPI) network

We obtained the rice and maize PPI network from the public database. The PPI network was constructed by mapping the genes in rice and maize to G.elata based on orthologous relationship. After removing duplicates of PPI pairs, a total of 53,657 PPI pairs with 5828 nodes was generated (Fig. 1F).

Construction of GelFAP v2.0

An improved platform for gene functional analysis in G.elata (GelFAP v2.0) was constructed based on functional annotation, gene family classification, co-expression and PPI network. There are six sections in the framework of GelFAP v2.0, including Home, Network, Pathway, Tools, Gene family, Download and Help. Network section contains PPI and co-expression Network. CYP450, TF, TR, PK, Ubiquitin, GAZy and EAR motif-containing proteins were included in the gene family section. To facilitate gene functional search and analysis of users, seven analysis tools were embedded into GelFAP v2.0, including Search, Blast, Motif Analysis, GSEA, Extract Sequence, Heatmap Analysis and JBrowse. Users could find the genes that they interested in by entering keywords and accurate accession number of gene, transcript or protein in search page. The Blast tool could be used to screen nucleic acid or protein sequences in G. elata that are similar to entered sequences. Motif analysis tool was used to search or enrich the motifs in the gene promoter regions. GSEA was used for gene set enrichment analysis, Sequence Extract tool could be used to Extract sequences based on gene accession number and location and Heatmap analysis was used to display gene expression data for candidate gene list. We also integrated JBrowse in GelFAP v2.0 to visualize genomic and transcriptome feature. Download and Help section provided the user with download information as well as user manual for the usage of GelFAP v2.0 (Fig. 2).

Fig. 2
figure 2

Organizational chart of GelFAP v2.0, including Network, Gene family, Tools, Home, Pathway, Download and Help

Network display with DEGs in GelFAP v2.0

To integrate gene co-expression/PPI network with expression, the differentially expressed genes (DEGs) were calculated from the three sets of transcriptome data and eight groups of DEGs were finally obtained. Then we constructed joint display node of networks and DEGs. In the display of our network, up-regulated DEGs were marked in red and down-regulated DEGs were marked in blue.

Functional application

  1. 1.

    Analysis of key enzyme genes in flavonoid biosynthesis pathway.

Flavonoids are secondary metabolites and play important roles in plant growth and development [42]. Flavonoid biosynthesis is catalyzed by several key enzymes [42], including PAL (phenylalanine ammonia-lyase), C4H (trans-cinnamate 4-monooxygenase), 4CL (4-coumarate–CoA ligase), CHS (chalcone synthase) and so on. The formation of flavonoids has eight different pathways, each leading to the formation of a different type of flavonoid compound [42]. It is reported that flavonoids are both in wild and cultivated G. elata [43]. According to KEGG annotation in GelFAP v2.0, there were 43 genes associated to flavonoid biosynthesis pathways were screened (Table S2). Based on the available enzyme information, we found that key enzyme genes mainly formed the backbone of myricetin synthetic pathways (Fig. 3A).

Fig. 3
figure 3

Regulatory analysis of key enzymes in flavonoid biosynthesis pathway in G. elata. (A) Flavonoid biosynthesis pathway and its key enzyme genes. (B) Co-expression relationship between TFs and key enzyme genes in flavonoid biosynthesis. (C) Co-expression relationship within key enzymes, which can be divided into 4 modules. (D) Motif enrichment analysis results of module1. (E) Motif enrichment analysis results of module2. (F) Motif enrichment analysis results of module3. (G) Motif enrichment analysis results of module4.

In order to better understand the relationship between key enzyme genes in flavonoids biosynthesis and TFs, co-expression analysis was conducted to identify the TFs which expressions were correlated with the key enzyme genes. The result demonstrated that MYB, HB, NAC and other TFs were co-expressed with these key enzyme genes (Fig. 3B and Table S3). Therefore, key enzyme genes might be regulated by these TFs. We further analyzed the potential co-expression relationships within key enzyme genes in flavonoids biosynthesis, four co-expression relationship modules were found (Fig. 3C and Table S4). Genes in a co-expression module often share similar expression pattern and are potentially regulated by the same TFs. Therefore, motif enrichment analysis of genes in each module were performed using motif analysis tool in our platform. And we found that TFs such as MYB, HB were significantly enriched in genes promoter region in co-expression modules (Fig. 3D, E, F, G). We predicted that co-expression relationship occurred among TFs and target key enzyme genes in flavonoid biosynthesis pathway.

  1. 2.

    Characteristic and functional analysis of C4H gene.

C4H is a key enzyme coding gene that catalyzes the flavonoids biosynthesis [42]. To access the characteristics of C4H gene, we utilized functional annotation information, co-expression network and analysis tools in GelFAP v2.0 to perform a comprehensive analysis. Detailed interface of the C4H gene provided gene functional annotation (Fig. 4A), transcript location and sequence (Fig. 4B), links for co-expression network (Fig. 4C), protein structure (Fig. 4D), classification for gene families (Fig. 4E), KEGG annotation (Fig. 4F), GO annotation (Fig. 4G) and expression value in different samples (Fig. 4H). Functional annotation, consists of protein functional annotation, KEGG pathway annotation, and GO annotation, provided important information for gene function. KEGG annotation showed the gene involved in flavonoid biosynthesis. In addition, C4H protein contained a single CYP domain and was belong to CYP450 family. Co-expression network analysis suggested that 11 genes positive co-expressed with C4H (Fig. 5A) and 133 gene negative co-expressed with C4H (Fig. 5B). Next, gene set enrichment analysis (GSEA) was used to determine the enriched GO terms of C4H co-expressed genes. We found that gene sets related to flavonoids biosynthesis were significantly enriched, such as ‘cinnamic acid biosynthetic process’ and ‘L-phenylalanine catabolic process’ (Fig. 5C). GSEA enrichment analysis for KEGG also showed the significantly enriched pathways associated with flavonoids biosynthesis (Fig. 5D).

Fig. 4
figure 4

Gene detail page of C4H gene. (A) Gene functional annotation. (B) Location and transcript sequences. (C) Network of C4H. (D) Protein structure and sequence. (E) Classification of gene family. (F) KEGG annotation. (G) GO annotation. (H) Expression level in different samples.

Fig. 5
figure 5

Functional analysis of C4H gene. (A) Positive co-expression network of C4H. (B) Negative co-expression network of C4H. (C) GO enrichment analysis results of C4H co-expressed genes. (D) KEGG enrichment analysis results of C4H co-expressed genes

  1. 3.

    Gene expression analyses for GAFP4.

G. elata usually has a symbiotic relationship with fungi [44, 45], which can cause various diseases. Previous study had shown that GAFP4 gene had potential antifungal activity [46, 47]. Through the transcriptome analyses, we found that GAFP4 gene were down-regulated in G. elata f.glauca compared to G. elata f.elata (Fig. 6A) and its co-expressed genes were also significantly down regulated in G. elata f.glauca compared to G. elata f.elata (Fig. 6B). The resistance of disease in G. elata f.elata was much higher than that in G. elata f.glauca [48], which was consistent with GAFP4 expression. Additionally, we found that the level of GAFP4 expression was up-regulated by fungi disease (Fig. 6C) and its co-expressed genes were also up-regulated by fungi disease (Fig. 6D). The result was consistent with the GAFP4 gene function study previously [46, 47].

Fig. 6
figure 6

Functional analysis of GAFP4 gene. (A) Expression for GAFP4 in G. elata f.glauca and G. elata f.elata. (B) The positive co-expression network with DEGs display of GAFP4 when G. elata f.glauca vs. G. elata f.elata. (C) GAFP4 expression in fungal-diseased and healthy mature tubers. (D) The positive co-expression network with DEGs display of GAFP4 in fungi-diseased mature tubers vs. healthy mature tubers

Discussion

G. elata is an orchid with important biological properties that has a completely mycoheterotrophic lifestyle in nature. There are currently 4 genomes of G.elata have been sequenced [2, 18,19,20], which has provided available resources to study biochemistry, genetics, molecular biology and molecular evolution. Therefore, integration the omics data of G.elata is important to assist researchers with scientific research. Finally, we constructed an improved platform for gene function analysis of G.elata (GelFAP v2.0) by integrating a new chromosome level genome, transcriptome data, processed annotation data and analysis tools. Compared with the first version of the platform, current version provides better genome data, more transcriptome resources and more analysis tools including Extract Sequence, Heatmap Analysis, JBrowse.

Flavonoids are one of the secondary metabolites in plants and contribute to plant growth and development [42]. They are also widely used in food, medicine and health care. Flavonoids include flavones, flavanols, isoflavones, flavonols, flavanones and flavanonols [42, 49]. For preliminary analysis regulatory mechanism of flavonoid biosynthesis in G. elata, we performed gene function and regulatory related analysis by information and tools provided in GelFAP v2.0. Our results showed that MYB, NAC, HB transcription factors might regulate the flavonoid biosynthesis, which has been reported in other related plants [50,51,52,53,54]. For example, expression of key enzyme genes is regulated by MYB–bHLH–WDR complex and further regulated biosynthesis of flavonoids [49]. On the other hand, we used the C4H and GAFP4 gene as examples to introduce the usages of this platform. One PAL, one C3’H and one E2.1.1.104 in flavonoids biosynthesis were directly co-expressed with C4H gene (Fig. 3A). One F3’5’H, one E2.1.1.104 and one HCT in flavonoids biosynthesis were indirectly co-expressed with C4H (Fig. 3B). Motif enrichment analysis for co-expressed genes also showed enriched TFs such as MYB (Fig. 3C and D). Previously study had suggested that MYB4 could regulate the expression of the C4H gene [55, 56], which encoded a key enzyme in flavonoid biosynthesis. Our analysis may provide references for users to use the platform in the future.

Up until now, many platforms of different plant species have been published to collect and analyze gene function information, such as Rice TOGO Browser [57], ATTED [58], bambooNET [59], NexGenEx-Tom [60], sorghumFDB [61], MCENet [62], croFGD [63], and TeaPVs [64]. Otherwise, several databases contained multiple species for a special plant family, for example, MaGenDB [65] and RPGD [66]. Different platform have different characteristics, most of them incorporated different tools for gene function comparison and analysis to meet different research. In our GelFAP v2.0, we integrated various tools including Search, Blast, Motif, GSEA, Extract Sequence, Heatmap Analysis and JBrowse. At the same time, Network, Gene family, KEGG and Download & Help options were also in the menu bar for researchers to search and download available information. Previously published gene function platforms about plant are mainly contained crops, fruits and vegetables, and few of them was medicinal plants. However, our platform is about medicinal plant G. elata, which is rarely found in previous studies, this can provide reference for the subsequent construction of other medical plant gene functional platform. At present, several gene function platforms have not been updated in time, and even some websites cannot be used normally. Our first version GelFAP was constructed in 2020, after that, we continuously paid attention to the research about G. elata, timely collected the latest genome and transcriptome data, and constantly updated the information of GelFAP. Thus, GelFAP v2.0 is updated in a short time, which will provide researchers with the latest information for scientific research.

Although we have improved the platform of G.elata, it should be pointed out that GelFAP v2.0 also has several limitations and much room to be improved. For example, we only integrated one chromosome genome data in the platform. With the release of different versions of the genome, we will continuously add those latest data in the platform. In the future, we also plan to integrate more new transcriptome data and improve the tools in the platform to meet various requirements for researches in the fields.

We believe that with the continuous development of sequencing technology, cost reduction and long-term investment, G. elata multi-omics data will continue to be accumulated. Effective and timely collection and processing of these data and updation of relevant information will be helpful for researchers to carry out their projects. The website is free available at www.gzybioinformatics.cn/Gelv2.