Background

Tea, produced from the dried leaves of tea plant, Camellia sinensis, is one of the most popular non-alcoholic beverages consumed worldwide [1]. Due to its great economic significance, tea plant had been cultivated for thousands of years, and nowadays is planted on a continent-wide scale [2]. In the past decades, tea research has focused on its numerous secondary metabolites, such as polyphenols, alkaloids, theanine, vitamins, volatile oils, and minerals, that contribute to tea quality and health effects [3,4,5,6,7]. However, many of metabolite-related genes [especially those catalyzing enzymes, regulatory transcription factors (TFs)] have not been functionally characterized. In addition to characteristic secondary components, the identification of genes related to other important agronomic traits, such as leaf yield, stress resistance, and bud development, is also significantly lagging [8,9,10], which lays an obstacle for applied genetic improvement and molecular breeding in tea plant.

The modeling and analysis of gene co-expression network has emerging as an efficient approach for the function prediction of uncharacterized genes in a focused species [11, 12]. This strategy is based on a simple assumption that functionally related genes are usually transcriptionally coordinated (co-expressed) in spatial–temporal states or across an array of environmental conditions. To date, several databases related to gene co-expression network have been actively developed for a variety of model species, such as human, Arabidopsis and rice [13,14,15,16]. It is noted that these resources are mainly established from microarray-derived transcriptome data that comes from a large amount of transcript profiling experiments in certain species. With the advent of next generation sequencing technologies (e.g. deep mRNA sequencing, RNA-Seq), the construction of a gene co-expression network and its application is now possible in non-model species with agricultural importance. By the end of 2019, more than 300 tea RNA-Seq experiments have accumulated in publicly available repositories [17]. In addition, our own lab has generated dozens of in-house RNA-Seq examples with concerned biological questions in the past several years. Therefore, RNA-Seq sample size in tea plant is now feasible for the statistical modeling of gene co-expression relationships at a genome-wide scale according to the relevant reviews [18,19,20], and a database platform can be established for screening candidate genes contributing to important traits of tea.

With the above considerations, we developed a database entitled ‘TeaCoN’ regarding a high-confidence gene co-expression network of tea plant using an optimized computational pipeline of large sample of RNA-Seq experimental data. TeaCoN implemented a user-friendly web interface that allowed users to browse, search, and download co-expression data of concerned gene(s). In addition, visualization of co-expressed genes in network paradigm and tissue expression pattern was presented. To facilitate use of TeaCoN in gene function prediction, a gene co-expression analysis, BLAST search function, genome browser, and GO and KEGG enrichment analysis were also configured. We believe TeaCoN may act as a valuable resource for novel gene identification of tea secondary metabolites and other important agronomical traits.

Construction and content

Data collection and preprocessing

We searched the database Sequence Read Archive (SRA) at National Center for Biotechnology Information (NCBI), using the keyword “Camellia sinensis”, to retrieve RNA-Seq experiments (different samples) that documented raw sequencing data of tea plant under a wide array of biological conditions. As seen in Fig. 1, describing the distribution of tea samples, the use of leave and bud is overwhelming due to their relatedness to tea infusion as direct materials. All the searched 298 RNA-Seq examples (saved as a metadata file with several important data fields) were manually checked to retain relevant records using the stringent criteria as: 1) in this study, tea plant was specifically chosen as one of the two main cultivated varieties, Camellia sinensis var. sinensis (CSS, Chinese type); thus the other lineage Camellia sinensis var. assamica (CSA, Assam type) was excluded using the data field “Organism Name”; 2) we selected RNA-Seq examples denoted as “RNA-Seq” and “Transcriptomic” in the paired data fields “Library Strategy” and “Library Source”, and discarded those RNA-Seq examples denoted as “WGS” and “Genomic” or other field pairs; 3) high-quality RNA-Seq examples were preferentially remained using the mark keyword “PolyA enriched” in the data field “Library Selection” according to the strategy proposed in [21]; and 4) deep-sequenced RNA-Seq examples were screened by choosing “> 500 M” using the data field “Total Size, M”. Using the above criteria, 288 RNA-Seq examples were retained for the following analysis. We used Aspera (version 3.7.2) to batch-download all the original SRA data of the collected RNA-Seq experiments in tea plant and used the command fastq-dump implemented in SRA Toolkit (version 2.9.1) to convert the SRA data into standard fastq format. In addition to the above publicly available RNA-Seq data in tea plant, dozens of in-house RNA-Seq experimental data (.Fasta format) in our lab were also manually checked and chosen in this study using the same criteria. For the above-pooled RNA-Seq data, clean reads were obtained from the raw sequenced reads using our in-house Python scripts by removing adaptor sequences and low quality reads, according to the method described in [22]. To facilitate the next-step implementation of genome-wide expression profile and gene co-expression network, the reference genomic data (.Fasta format) and the corresponding genomic annotation data (.GTF format) of tea plant (CSS variety) were download from our International Tea Plant Genome Sequencing Consortium [23].

Fig. 1
figure 1

An overview of tea samples used in the construction of gene co-expression network. Treated or untreated tea tissues were sampled in the original studies. The mostly-used tea tissues were leave and bud, accounting for 39 and 15%, respectively. In the treated tea tissues, fluoride and ammonium were widely used (11%). It is noted that ~ 19% of the total tea samples were not indicated as treat/untreated tissues in the original studies

Gene expression profiling at a genome-scale

To improve the read alignment efficiency, the reference genomic data of tea plant was used to build a genome index using the command hisat2-build implemented in Hisat2 (version 2.1.0) with default parameters. In the read alignments, we retained RNA-Seq samples with over 65% of reads mapping to the reference genome and, of these, at least 40% of those reads mapping to coding sequences, according to the method described in [21]. We called this as the two-round quality control of tea samples compared with the above-used criteria, and finally a total of 261 RNA-Seq samples were retained. All the clean reads of each of the above tea samples were mapped to the indexed reference genome using the command hisat2 with default parameters. The generated SAM format alignments together with the reference genome GTF annotation data were then fed to HTSeq-Count (version 0.9.1, with default parameters) and our in-house Python scripts to quantify the expression level of each of the tea gene models in different biological conditions using the three classical measures as Reads Per Kilobase Per Million Mapped Reads (RPKM), Fragments per Kilobase Million Mapped Reads (FPKM) and Transcripts Per Million (TPM). In this study, we used TMP as a standardized transcript abundance measure (that considers normalization of differences both in sequencing depth and gene length among different biological samples) to implement a gene expression profile at a genome scale that documented the expression abundance of each tea gene models in different biological conditions (saved as a mathematical matrix format). As shown in Fig. 2, nearly 60% of the total tea genes expressed on more than 90% of the total biological samples, and only 109 tea genes did not express on all the 261 biological samples, which were accordingly removed in the construction of tea gene co-expression network.

Fig. 2
figure 2

Gene coverage versus tea sample coverage using expressed gene index. The abscissa indicated the ratio of tea samples where tea gene(s) express, and the ordinate indicated the ratio of the expressed genes to the total 33,932 genes in the corresponding tea sample coverage bins

Construction of gene co-expression network

Pearson correlation coefficient (PCC-value) was used as an index to evaluate the similarity of expression profiles between every pair of tea gene models. We used the function pearsonr implemented in Python statistical function library (Scipy.stats) to calculate the corresponding statistical P-values of obtained PCC-values. A P-value less than 0.01 indicated that the expression profile correlation between a gene pair across a large number of biological samples has the statistical significance compared with random control, and thus the correlated gene pair should be retained for the co-expression network construction. In this pre-constructed network, a node denoted a gene, and a link was placed between a correlated gene pair indicating their co-expression relationship. As to the selection of a PCC-cutoff in the following network trimming, we referred to the method described by our colleagues [24, 25]. This approach considers the fact that different types of biological networks are mostly characterized as scale-free, and thus in a biological network, modular structure with high network density can be used to describe actual cellular organization [26]. In details, two network properties, modularity and high density, will be adopted as general biological network criterions for the rational selection of a cutoff in the network construction [27]. To achieve a gene co-expression network for tea plant, a range of PCC-cutoffs were considered to generate a family of gene co-expression networks. For these member networks, we estimated how well an individual network satisfies a scale-free property using the model fitting index R2 of the linear regression for the logarithmic transform of the node degree distribution [28]. Here, degree of a node represented the number of nodes linked to this node. We then applied average node degree for these individual networks to measure the network density. As indicated in Fig. 3, with the increasing PCC-cutoffs, the network density decreased whereas the scale-free model fit (R2) increased to achieve a maximum at a PCC-cutoff of 0.70 with an R2 equal to 0.87 and a moderate network density of 24.584. We chose an absolute PCC-cutoff of 0.70 to consider that two genes are significantly co-expressed, which established a compromise between the generation of a scale-free network and a high network density. We called the tea plant gene co-expression network generated using this PCC-cutoff as TeaCoN, which represented 7,347,994 co-expressed gene pairs covering 31,968 (94% coverage of the genome) tea gene models.

Fig. 3
figure 3

Network density and scale-free model fit (R2) of network based on changing PCC-cutoffs. The abscissa represented the changing PCC-cutoffs from 0.3 to 0.9. The left-blue ordinate represented network density (average node connectivity) and the right-red represented the scale-free model fit (R2) of the resulted network using a certain PCC-cutoff. It can be seen that network density decreased with the changing PCC-cutoffs, whereas the scale-free model fit (R2) increased and then decreased, reaching a maximum value of 0.87 when the PCC-cutoff is 0.7

Database implementation

TeaCoN was implemented in a free and open source Python Web framework, Django (https://www.djangoproject.com), with a popular relational database management system, MySQL (https://www.mysql.com) as the backend database. TeaCoN has a user-friendly web interface and its frontend pages are generated via HTML5, CSS3, jQuery (http://jquery.com), Bootstrap (https://getbootstrap.com), and DataTables (https://datatables.net). The gene co-expression network and expression profile heatmap of co-expressed gene were visualized by vis.js (http://visjs.org) and ECharts (http://echarts.baidu.com), respectively. The function of BLAST search, GO and KEGG enrichment analysis were established by a BLAST+ (version 2.8.1) back-end in python and a R package clusterProfiler [29], respectively, and the corresponding task queue is achieved through RabbitMQ (http://www.rabbitmq.com). We also implemented the Jbrowser (http://www.jbrowse.org) as a genome browser for gene model visualization in tea genomic location.

Utility and discussion

Web interface

TeaCoN provides a concise and user-friendly web interface that allows for the predicted tea gene co-expression associations to be clearly browsed, searched, and downloaded. In addition, gene co-expression analysis, BLAST search function, expression profile heatmap, genome browser, and GO and KEGG enrichment analysis were deployed to facilitate use of TeaCoN. To make it convenient for tea researchers to use TeaCoN’s utilities, several of the above-deployed tools, such as gene co-expression analysis and expression profile heatmap, can be directly-used in Browse and Search pages by using a several-steps button-clicking (Fig. 4).

Fig. 4
figure 4

An interactive design frame for the use of Browse, Search, and tools. In Browse and Search pages, users can logically browse and keyword-guided search tea gene co-expression information, separately (a). Upon a query, a list of genes were displayed in a tabular page (b) where a certain gene’s detailed co-expression information (e.g., gene expression boxplot and co-expression network visualization) was left-shown (c), and the analysis results (e.g., using GO, KEGG enrichment analysis) for a list of ticked genes were right-shown (d)

In tea plant, the disclosure of enzyme genes and the corresponding regulatory TF genes involved in its three major characteristic secondary metabolic pathways (theanine, caffeine, and catechins) has become an active field in the past decades [30]. Therefore, we designed a Browse page that can be viewed from the logical categories as characteristic secondary pathway, TF family, and annotated gene model. In the search page, remote users can search the database using keywords. Four search fields were deployed as follows: (1) gene ID, (2) gene symbol, (3) gene name, and (4) chromosomal location. TeaCoN has a fuzzy search engine that allows entry searching even when a queried keyword is not exact. Upon a fuzzy search, a list of records will be presented based on spelling relevance where users can manually check to find the exact one of interest. As a publicly accessible database, TeaCoN provides an easily-used download page that allows for the predicted gene co-expression data of tea plant to be fully downloaded as a whole or partly downloaded in a customized fashion using a logical selection of a certain secondary pathway, TF family and PCC-cutoff.

A gene co-expression analysis was implemented in TeaCoN to aid in detecting genes with similar expression profile across different biological conditions. A single gene or a maximum of 50 genes can be input as query gene(s) in the submitting form, with an additional PCC-cutoff that can be chosen for more high-confident gene co-expression associations. Upon a query, a gene co-expression network related to the queried gene(s) can be displayed, together with a detailed tabular information regarding the network (e.g., co-expressed gene pairs, PCC-values and P-values). It is noted that a depth selection of the co-expression search was implemented for users to get more information regarding co-expressed genes of retrieved co-expressed genes of query genes. We deployed a BLAST search function in TeaCoN to assist users to align query sequences against all the tea gene sequences archived in this database. Several parameters including program, num_descriptions, e-value and num_alignments can be chosen to conduct a user-customized sequence alignment. Upon a BLAST search, a list of relevant genes with similarities to the query sequence will be returned, as well as align scores, evalues, and useful link(s) to the details page of related gene co-expression associations.

We presented a function for users to get a view of the expression profile heatmap of co-expressed genes across seven main tea tissues, including leave, axillary bud, bud, stem, flower, ovary, root, seed and tender shoot. In this visualization, the color gradient represented the log10TMP-value of gene’s expression on each of the seven tea tissues. We implemented the JBrowse that is a dynamic web platform for genome visualization and analysis [31]. In this application, several useful tracks, such as DNA, GC skew, and GFF3 annotation, were deployed for users. Particularly, the track “RNASeq coverage” related to the RNA-Seq data used in TeaCoN can be used to show the expression of genes in their exon level across different tea samples, highlighting any possible isoform, tissue specific expression, exon skipping, and alternative splicing.

GO and KEGG enrichment analysis were also implemented to help users to determine the possible function (or pathway) that a set of genes have (or involved in). After inputting a gene list, the cutoff value of P-value, the P-value adjustment method and the Q-value cutoff can be adjusted by users. In the GO enrichment analysis, the subontologies, Biological Process, Cellular Component, and Molecular Function, should be customized. Upon a query, the IDs of the GO/KEGG terms, descriptions, genes, p-values and other information will be displayed in an ascending qvalues.

Case study

The analysis of gene co-expression data has the potentiality in dissecting genes responsible for a certain agronomic trait in tea plant. As a focus-studied characteristic metabolite, theanine is a unique non-protein amino acid in tea plant and has no reference pathway in data-rich model species, such as Arabidopsis thaliana and rice. Thus, the decoding of regulatory TF genes related to theanine biosynthesis cannot be performed using a cross-species gene knowledge translation (traditional homology-based search). To fill this vacancy, we used the genome-wide co-expression associations in tea plant and several known theanine enzyme genes (e.g., GS, GOGAT, ADC, GDH) as baits to achieve a comprehensive TF-enzyme gene co-expression network (Fig. 5). In this bipartite network, 48 gene co-expression relationships were documented between 48 TF genes and 6 theanine enzyme genes. It is noted that only four TF families were found to be associated with theanine biosynthesis, which have been reported to be related to the transcriptional regulation of plant secondary metabolism in the previous studies [32,33,34]. In the four TF families, MYB, bHLH, and WRKY families were prominently involved with a large number of members. Recently, we has reported a possible MYB-bHLH complex that regulates the accumulation of anthocyanin, another characteristic secondary metabolite abundant in tea plant [35]; so the identified TF genes may provide useful information for the elaborate transcriptional regulatory research of theanine biosynthesis.

Fig. 5
figure 5

A TF-enzyme gene co-expression network for theanine biosynthesis. In the bipartite network, circle nodes and hexagon nodes represented TF genes and theanine enzyme genes, respectively. An edge was placed between a TF gene and an enzyme gene if their expression profile relatedness exceeded the predefined PCC-cutoff used for the construction of TeaCoN. For TF genes, their TF family classifications were different-colors-indicated in their nodes representation

Discussion

Tea plant (Camellia sinensis) is one of the world’s most popular beverage crops and widely cultivated and utilized due to its economic significance [36]. To date, the decoding of genes involved in its important agronomic traits, such as characteristic components biosynthesis and stress resistance, is still significantly lagging, which restrict the applied studies in genetic improvement and molecular breeding [37, 38]. Network-assisted gene prioritizing has emerging as a powerful approach in the identification of candidate genes responsible for an agronomic trait of interest [39]. Among different forms of biological networks, gene co-expression network is widely employed using a simple and efficient strategy based on a vast amount of microarray- and/or RNA-Seq-derived transcriptome data. Currently, large samples of transcriptome data for tea plant has been generated with the popularity of RNA-Seq technology. Therefore, a good opportunity is now present for the development of a bioinformatics platform associated with a gene co-expression network that may aid in mining novel genes for wet experimental biologists in a spotlight field.

A high-confident gene co-expression associations at a genome-scale is a prerequisite in the gene identification related to a certain trait of tea plant. To this end, we adopted a multi-steps quality assessment in the network modeling, such as manual sample check, low-quality reads filtering, and network PCC-cutoff rational selection. With the gene co-expression association data available, we developed a database named ‘TeaCoN’ to present a platform for the inferred data to be browsed, searched and downloaded in a easy-to-use mode. In addition to the textual gene co-expression information, several graph displays, e.g., network visualization and expression profile heatmap of co-expressed genes, were included to present useful biological clues. To facilitate use of TeaCoN in gene identification, we also implemented a gene co-expression analysis and a BLAST search function. These two functions, together with other utilities in TeaCoN, can be used separately or cooperatively by tea researchers. For example, with the cloned sequence of tea dehydrin gene named ‘CsDHN2’ (NCBI accession: GQ228834.1), a drought-responsive gene, the homology-based BLAST search function retrieved two candidate gene models (TEA010666.1, TEA010673.1) which showed a high proximity in genomic location (see (a) and (b) in Additional file 1). We also revealed that these two candidates have the similar expression pattern across seven typical tea tissues (e.g., leaf, bud, stem, root) using the expression profile heatmap visualization (see (c) in Additional file 1). Moreover, these two genes was shown to grouped in a gene module using gene co-expression analysis in TeaCoN (see (d) in Additional file 1). These findings observed from the above combined use of TeaCoN functionalities revealed the possible functional genes related to tea drought resistance, which might provide valuable clues for tea experimental biologists in the downstream experimental design to enhance this topic in tea plant.

The TeaCoN project provides an initial groundwork for predicting and distributing a resource of genome-wide co-expression association in tea plant that aids in the function prediction of uncharacterized genes for this important beverage crop. With increasing amounts of RNA-Seq data in tea plant available, ongoing data analysis and network re-modeling using our computational pipeline will be scheduled on a three-months basis to facilitate a more comprehensive and reliable version of TeaCoN. It is noted that gene co-expression network has a relative low confidence of the gene-gene functional association (as it is inferred from a single-transcriptome-based gene expression similarity) and thus has limitations in gene function prediction. In the future, we will consider a prediction and integration of multi-level of gene functional associations, such as protein-protein interaction and transcriptional regulation, to enhance single-network of gene co-expression into multi-network of gene co-function using a statistic computational framework [40]. In addition, random walk could be a valuable algorithm with the gene co-expression network data for efficiently screening novel genes responsible for a tea trait of interest. We believe that these strategies are promising to increase the data confidence and functional availability of this database platform, and promote broader interest from researchers in tea research community.

Conclusions

In this study, a high-quality tea gene co-expression network was modeled and a related database platform named ‘TeaCoN’ was presented. TeaCoN provided detailed gene co-expression information that can be easily browsed, searched, and downloaded. Despite tea-specific utilities (e.g., gene co-expression analysis, expression profile heatmap), several commonly-used tools, including BLAST search function, GO and KEGG enrichment analysis, and genome browser were also deployed for users to conduct functional gene screening of a concerned tea trait without any bioinformatics skills. Thus, we hope that TeaCoN would be a valuable resource for future investigators in experimental biology concerned with commercial tea cultivation and characteristics.