In-depth transcriptome characterization uncovers distinct gene family expansions for Cupressus gigantea important to this long-lived species’ adaptability to environmental cues
Cupressus gigantea, a rare and endangered tree species with remarkable medicinal value, is endemic to the Tibetan Plateau. Yet, little is known about the underlying genetics of the unique ecological adaptability of this extremely long-lived conifer with a large genome size. Here, we present its first de novo and multi-tissue transcriptome in-depth characterization.
We performed Illumina paired-end sequencing and RNA libraries assembly derived from terminal buds, male and female strobili, biennial leaves, and cambium tissues taken from adult C. gigantea. In total, large-scale high-quality reads were assembled into 101,092 unigenes, with an average sequence length of 1029 bp, and 6848 unigenes (6.77%) were mapped against the KEGG databases to identify 292 pathways. A core set of 41,373 genes belonging to 2412 orthologous gene families shared between C. gigantea and nine other plants was revealed. In addition, we identified 2515 small to larger-size gene families containing in total 9223 genes specific to C. gigantea, and enriched for gene ontologies relating to biotic interactions. We identified an important terpene synthases gene family expansion with its 121 putative members.
This study presents the first comprehensive transcriptome characterization of C. gigantea. Our results will facilitate functional genomic studies to support genetic improvement and conservation programs for this endangered conifer.
KeywordsCupressus gigantea De novo transcriptome assembly Ecology Gene family evolution Gene functional annotation
Leucine-rich repeat receptor-like protein kinases
Universal stress protein A
Cupressus gigantea W.C. Chen et L.K. Fu, also called Tibetan cypress or giant cypress, is a rare and vulnerable conifer tree species endemic to the Tibetan Plateau. Natural populations are scarce for this species, which led C. gigantea to be listed on the Chinese National Protection List of Wild Plant (Class I) . As a rare conifer, C. gigantea grows sparsely in the narrow dry valleys of the Yarlung Zangbo and Nyang Rivers on the Tibetan Plateau at an elevation band between 3000 and 3400 m . C. gigantea is an excellent timber species with high wood density, straight grain, and radial uniformity; an average tree height of about 20-30 m, with few trees reaching up to 50 m; the diameter can reach up to 6 m and the age of the trees can reach more than 2600 years, making C. gigantea one of the long-lived endemic cypresses in China . According to the International Union for Conservation of Nature (IUCN) Red List categories of threatened species, C. gigantea has been classified as a vulnerable species due to highly disturbed distribution leading to serious populations reduction . Therefore, based on this special status, C. gigantea not only represents important timber and ornamental values, but also great scientific value in terms of ecology and conservation biology, regarding the species’ adaptation to an extreme environment due to the unique geography.
Cupressus gigantea has attracted wide interest and has been increasingly studied since it was described as a species by 1975 . Previous studies focused on its geographic distribution , photosynthetic capacity , and its unique ecology  and community characteristics  and phylogenetic status . However, these studies did not address the species’ molecular genetics probably due to C. gigantea’s high genetic load . Only recently, Li and co-workers isolated and characterized 16 polymorphic microsatellites from C. gigantea using paired-end Illumina shotgun sequencing . Subsequently, the species’ complete chloroplast genome was determined . However, these studies did not provide any functional genetic determination of the extreme adaptive potential present within C. gigantea. The mining of genes related to adaptive mechanisms such as those involving stress resistance superfamily genes is indispensable to decipher the genetic underpinnings of adaptive phenotypic traits. Using this information in population-wide genetic screens has the potential to accelerate formulations for effective conservation strategies concerning this vulnerable conifer species whose distribution is restricted to the Qinghai-Tibetan Plateau.
Plants have evolved well-orchestrated resistance mechanisms to defend themselves against various environmental pressures . The expression of stress resistance superfamily genes is stimulated in order to render various protective effects to the plant under adverse environmental circumstances. Universal stress proteins (USPs) constitute a natural biological defense mechanism by providing general “stress endurance.” The USP domain contains a protein structure originally identified from Escherichia coli as USPA (universal stress protein A) because of its prominence in the stationary phase of bacteria growth. These genes function as regulators of cell survival under heat, starvation, and other biotic or abiotic stresses [14, 15, 16]. An additional group of stress molecules involve leucine-rich repeat receptor-like protein kinases (LRR-RLKs) that represent the largest group of RLKs; they mainly contain three functional domains: an extracellular domain (ECD), an intracellular kinase domain (KD), and a transmembrane (TM) domain. Previous studies indicated that LRR-RLK genes play crucial roles in meristematic growth, secondary growth, response to environmental stimuli, bacterial pathogens, and necrotrophic fungi and viruses [17, 18, 19, 20, 21, 22]. Furthermore, terpenoids, usually constitute a very large and structurally diverse group of natural products and play main roles in plant defenses and stress resistance . The majority of the terpenoids found in plants so far, have proven functions in plant defenses . Overall, Eucalyptus grandis seems to have the largest number (113) of putatively functional terpene synthase (TPS) genes compared to other sequenced plant genomes . Hence, research regarding such defense related gene families will enable better understanding of the diversity of defense genes of a plant species and, at the same time, gene-family phylogenetic analysis will also help to infer gene functional characterizations.
In the present study, we performed de novo transcriptome assembly from terminal buds, microstrobili, female strobili, biennial foliage, and cambial tissues originating from adult C. gigantea. We assembled this transcriptome to annotate transcripts using available information in public databases, further categorized for biological functions and pathways, and characterized the diversity and evolutionary history of genes involved in plant stress responses. This represents the first comprehensive description of the global C. gigantea transcriptome to date. These new resources will contribute substantially to future functional genomic studies and conservation programs for this endangered species.
RNA-seq and de novo transcriptome assembly
In total, the paired-end sequencing yielded 153,140,282 raw read pairs. We initially evaluated the raw read base quality (Additional file 1: Figure S1), trimmed poor-quality bases, and removed all poor-quality reads with Trimmomatic (version 0.36) software  with default parameter settings (Additional file 2: Figure S2). After having removed the adaptors and all low quality sequences, the total number of the clean reads amounted to 144,175,052 reads (94% of all initial reads). Next, de novo assembly using Trinity  produced a total of 135,542 contigs (103,584,408 bp in total, with mean length being 764 bp and with 18,728 bp the longest read length (Additional file 3: Table S1). These contigs (ordered sequences) were then joined into scaffolds. Subsequently, 102,553 scaffolds (104,432,740 bp in total length) were obtained, with a mean length of 1018 bp and the longest length being 23,390 bp. Subsequently, we extracted unigenes from the assembly obtained with Trinity. A total of 101,092 unigenes (104,109,640 bp) were obtained, the average length was 1029 bp with the longest being 25,331 bp. The unigenes length’s distributions are shown in Additional file 4: Figure S3, and the N50 score for unigenes was 1508 bp.
Annotation and further functional classification of the gene space in C. gigantea
All 101,092 assembled unigenes were searched against Nr, KOG, GO, KEGG and Swiss-Prot protein databases using BLASTx with a 1E-5 E-value cutoff (Additional file 5: Table S2 and Additional file 6: Table S3). Functional annotation of the unigenes against these protein databases revealed a total of 33,302 (32.94% of the total) unigenes (Additional file 7: Table S4) with corresponding annotations in Nr and 24,078 (23.81%) unigenes (Additional file 8: Table S5) showed significant similarity to known proteins in the Swiss-Prot protein database. The top-scoring BLASTx hits against the Nr protein database revealed strongest similarities to Picea sitchensis (24.27%), Amborella trichopoda (9.69%), and Vitis vinifera (7.36%) genes.
Querying against Swiss-Prot containing proteins, we found 24,078 unigenes with matching hits, accounting for 23.81% of the total annotations (Additional file 8: Table S5). The top-ten most similar species from Swiss-Prot results were Arabidopsis thaliana (48.74%), Oryza sativa subsp. japonica (5.37%), Schizosaccharomyces pombe (strain 972/ATCC 24843) (3.89%), Nicotiana tabacum (3.64%), Saccharomyces cerevisiae (strain ATCC 204508 / S288c) (3.43%), Homo sapiens (3.02%), Mus musculus (3.02%), Drosophila melanogaster (2.03%), Dictyostelium discoideum (1.49%) and Nicotiana glutinosa (1.49%).
The functional classification of GO categories was carried out with Blast2GO. A total of 28,087 unigenes (27.78% of all unigenes) matched with classifications of 3 GO functional categories: biological process (BP), cellular component (CC) and molecular function (MF) (Additional file 9: Figure S4). There were 26 subcategories for biological process, including metabolic processes (15,734, 15.56%), followed by cellular processes (15,693, 15.52%) and single-organism processes (13,502, 13.36%). CC was divided into 17 subcategories including cell part (42,520, 42.06%), followed by cell (21,262, 21.03%) and organelle (16,223, 16.05%). 18 subcategories were found under MF, including the term catalytic activity (20,356, 20.14%), followed by binding (7,878, 7.79%) and nucleic acid binding transcription factor activity (5,145, 5.09%).
A total of 16,600 unigenes (16.42%) matched entries in the KOG database  providing 18,810 functional annotations (Additional file 10: Figure S5). Among all 25 corresponding KOG categories, the largest category was signal transduction mechanisms (2,714, 2.68%), followed by general function prediction (2,292, 2.27%), posttranslational modification, protein turnover, chaperones (1,769, 1.75%), unknown function (1,123, 1.11%), carbohydrate transport and metabolism (1,040, 1.03%) and other categories with a percentage lower than 1%. The smallest category for KOG annotations was cell motility, containing only 5 unigenes as a result.
All unigenes from the C. gigantea transcriptome assembly were subjected to KASS (KEGG Automatic Annotation Server) pathways annotation (Additional file 11: Table S6). We found 6,848 unigenes (6.77%) matching with a total of 292 pathways. The overrepresented pathways were metabolic pathways (2,067 unigenes, 30.18%) and biosynthesis of secondary metabolites (1,483, 21.66%). These pathways provide a valuable resource for investigating specific molecular processes in C. gigantea. Furthermore, 375 unigenes involved environmental adaptive pathways, which contained the following five pathways: plant-pathogen interaction (211 unigenes), circadian rhythm plant (24), circadian rhythm-fly (8), circadian entrainment (73), and circadian rhythm (18). These pathways may be related to controlling plant physiology. For example, such physiological activities help to adapt to environmental changes by controlling the circadian rhythm . Here, metabolic pathway of the terpenoid backbone biosynthesis for the unigenes identified in C. gigantea is shown in Additional file 12: Figure S6.
Expansion/contraction of gene families in C. gigantea
Functional enrichment analysis of the C. gigantea gene family
systemic acquired resistance, salicylic acid mediated signaling pathway
root meristem growth
detection of biotic stimulus
detection of external biotic stimulus
detection of bacterium
detection of other organism
regulation of meristem growth
regulation of root meristem growth
regulation of ion transmembrane transporter activity
regulation of transmembrane transporter activity
regulation of transporter activity
cyclic purine nucleotide metabolic process
chloroplast RNA modification
cyclic nucleotide biosynthetic process
cAMP biosynthetic process
regulation of plant organ morphogenesis
cell wall part
external encapsulating structure part
cAMP metabolic process
Discovery of gene families related to resistance in C. gigantea
To explore evolutionary divergence of the USPA-like domain among sequences collected here, we performed motif analyses using the MEME program. MEME analysis identified 3 motifs in the USPA-like domain, classified from the N terminus to the C terminus as M1, M2, and M3. (Fig. 2b). The alignment of the USPA-like domain with all known secondary structure elements and conserved residues is showed in Additional file 15: Figure S8. We found motif M1 is shared across all groups and almost all members of each group with motifs M2 and M3 were also shared across most groups, with the exception of the Small_Plant group. As much as 64% of the identified unigenes in C. gigantea contained 3 motifs, and for the remaining 36% 2 motifs (16%) to only 1 motif (20%) were found. Seven out of the eight most conserved residues (D13, V41, G127, G130, G140, S141, V142, and T143) within the USPA structure for 1MJH and ATP binding were identified for the USPA-like unigenes from C. gigantea.
Although the KD domain is relatively well conserved, it can be divided into 12 smaller subdomains useful for elucidating evolutionary divergence [33, 34]. In our study, 43 sequences from C. gigantea classified as LRR-RLK were identified with the MEME program. According to the position of the kinase domain and conserved amino acid residues, eight motifs (M1-M8) are shown in Fig. 3b and Additional file 16: Figure S9, and which contain 11 subdomains in total but without the X subdomain which is the most poorly conserved subdomain and its function is also unknown . M2 and M6 motifs are shared across all LRR-RLK proteins identified in C. gigantea. Motifs M1, M5, M6, M7 and M8 correspond to conserved subdomains I & II, VIb & VII, VII, IX, and XI, respectively. These motifs are shared by almost all subfamilies except for motifs M5 and M8 that are not shared by any members of subfamily III. Meanwhile, two less conserved subdomains were also found. Motifs M3 and M4 correspond to subdomains V and VIa. These motifs are shared by almost all LRR-RLK genes. In addition, motif M2 corresponds to two subdomains, with conserved subdomain III and less conserved subdomain IV. The motif is shared by all subfamilies and all members of each subfamily.
Plant TPSs can be classified into two groups, class I and class II, with a DDxxD motif involved in divalent metal binding and considered a characteristic feature of class I TPS [40, 41]. In our study, the DDxxD motif in C. gigantea varied, occurring as DD (I/L/F/T/C) (Y/F) (D/Y/E) starting at the protein sequence position after about 480 amino acids (Additional file 17: Figure S10). Among the TPS unigenes from C. gigantea, motif DDxxD is shared by almost all subfamilies, with the exception of the TPS-c subfamily. In addition, a conserved RxR motif located upstream of the DDxxD motif was also found by using the MEME program. The RxR motif varied in the TPS-e subfamily occurring as RxK. The RxR motif produced 69 hits while the DDxxD motif identified 77 sites according to their position and conserved amino acid residues within the 121 TPS unigenes from C. gigantea, for which 50 TPS unigenes contained RxR and DDxxD motifs. Furthermore, RxK and DDxxD motifs were distributed across 16 unigenes of TPS-e subfamily in C. gigantea.
Validation of the presence and the potentially functional divergence among the individual gene members
Reverse Transcription PCRs (RT-PCRs) successfully amplified targets sequences with 35 out of 36 pairs of primers designed for amplification of gene members from USPA (primer pairs designed to amplify target from 9 gene members), LRR-RLK (13) and TPS (14) gene families. PCR with DNA as template amplified the target sequence with the only primer pair (c106946_g1_i2) which failed to amplify target in RT-PCR. Successful amplification indicated the high fidelity of our transcriptome gene assembly. Further real-time quantitative reverse transcription PCR (qRT-PCR) revealed significantly different gene expression among gene members from the same family, suggestive of the potentially functional divergence among the gene members. See Additional file 18: Figure S11 and Additional file 19: Figure S12 for gel electrophoresis from RT-PCR and PCR and results of differential gene expression tests.
Cypresses are endemic trees or shrubs prone to disjunctive distributions in temperate regions throughout the northern hemisphere . Because of its isolation, C. gigantea has a narrow and scattered distribution located only in the high altitude region of the Tibetan plateau in the southwest of China and its populations sizes are small. Due to low reproductive output, difficulty of seed germination, and the severe environmental factors associated with high elevation C. gigantea remains endangered . Considering the urgent need for C. gigantea conservation and functional characterization of the species’ adaptive potential, its global transcriptome characterization could provide the basic genomic information for future assessment of the species’ genetic variation at the molecular level. In the present study, circa 14.41 million high quality reads were assembled into 101,092 unigenes, with an average sequence length of 1029 bp. The transcriptome data of C. gigantea was compared to those of other conifers whose genomic data were recently released (Additional file 20: Table S8).
Gene family membership may be reduced due to incomplete expression of the proteome. However, the high quality of the transcriptome data still enabled us to discover and annotate genes associated with fundamental evolutionary processes. Our results indicated that there are 2515 unique gene families (containing 9223 genes) unique to the C. gigantea lineage following its divergence from the most recent common ancestor shared with any other taxon. Functional annotation of 3804 unigenes demonstrated that they are mainly enriched in functional categories involved in biological processes, including systemic acquired resistance and the salicylic acid mediated signaling pathway (GO:0009862) and the detection of biotic and external biotic stimuli (GO:0009595, GO:0098581).
Universal stress protein A (USPA) was proposed to have two domain types, with 1MJH involved in ATP-binding and 1JMVA without ATP-binding potential. Many studies have shown that the USPA-like domain is derived from a 1MJH-like ancestor in plants such as A. thaliana , Catharanthus roseus  and animals such as urochordates as well as all Cnidaria and Lophotrochoza . In the present study, the phylogenetic and structural analyses also indicated that USPA domains derived from a 1MJH-like ancestor for C. gigantea. We found that all C. gigantea sequences were more closely related to the bacterial 1MJH than to 1JMVA with strong bootstrap support in the phylogenetic tree. Moreover, we obtained the conserved motifs of the USPA-like domain in C. gigantea through MEME analysis which is basically consistent with the annotated structure of 1MJH except for the alpha 3 block. The eight residues (D13, V41, G127, G130, G140, S141, V142, and T143) in the USPA structure of 1MJH needed for ATP binding were also identified in the alignment of USPA-like domains, yet the V41 located at beta 2 block of 1MJH that binds adenine was not found in the motif. It may be that the beta 2, alpha 2, beta 3, and alpha 3 blocks are less conserved than other motifs. In addition, it is possible that such genetic variation was further shaped during the evolution of C. gigantea.
Leucine-rich repeat receptor-like protein kinases (LRR-RLK) genes were classified into 19 sub-families according to our phylogenetic analysis, which is consistent with previous studies . However, LRR-RLK genes from C. gigantea were divided into 7 sub-families, i.e. III, VII-1, VII-2, X, XI, XII, XIII-2 sub-families, respectively. Considering that we used transcriptomes, the information of genes may be incomplete due to transcripts that were not expressed in our sample and thus, information could have been missed for other sub-families in C. gigantea. Twelve kinase domain sub-families have already been recognized and also implicated in playing essential roles in enzyme function . In the present study, we identified 8 motifs containing 11 sub-families through MEME motif analysis. M2 and M6 motifs are shared by all LRR-RLK proteins identified in C. gigantea. These common motifs indicate functional similarities related to kinase activity. The M2 motif corresponds to the sub-family III with conserved residues and sub-family IV with less conserved residues, and M6 corresponds to sub-family VII with conserved residues. Sub-families III and VII contain invariant residues that are crucial for maintaining kinase activity and peptide substrate recognition [33, 34]. Sub-family IV contains no invariant or nearly no invariant residues and therefore does not appear to be directly involved in catalysis or substrate recognition .
Terpenoids are compounds widely found in plants, and they are the main components of some resins, which provide resistance to biotic challenges such as disease causing agents. Conifers may also have a large and diverse terpene synthase (TPS) gene family given the diversity of TPS genes that have been characterized in other studied species . The 113 and 106 putative functional TPS genes identified in E. grandis and E. globulus, respectively, represent approximately four times as many as in A. thaliana (40 putative functional genes) [25, 45]. In our study, we discovered 121 such unigenes in the C. gigantea transcriptome. Phylogenetic analyses of TPS unigenes recognized eight major sub-families, designated TPS-a through TPS-g and TPS-SM. The TPS-d3 sub-family was further divided into three groups that were renamed d3–1, d3–2, d3–3 according to various terpene molecules. More convincingly, the conserved and short amino acid sequence DDXXD, implicated in catalytic function with Mg2+ binding, was also identified. Two conserved motifs in TPS, namely RXR and DDXXD, are separated by a short region of 35 amino acids. The two motifs are thought to direct the diphosphate ion away from the carbocation upon cleavage of the preny1 diphosphate substrate .
This study provides the first comprehensive transcriptome analysis of C. gigantea. In total, 101,092 unigenes with high sequence quality were obtained and were functionally classified based on BLASTx searches across multiple databases. Putative universal stress proteins (USPs), leucine-rich repeat receptor-like protein kinases (LRR-RLKs) and terpene synthase (TPS) genes found in C. gigantea are known to be involved in many different biotic and abiotic stress responses. We described a set of 2515 specific gene families containing 9223 genes for C. gigantea and demonstrated their usefulness for phylogenetic reconstruction. These unigenes and their analyses will likely form the foundation for future genetic analyses in C. gigantea, and we strongly believe that this public transcriptome database will serve as an important information platform to help us understand the genetic underpinnings of stress adaptation in C. gigantea and other closely related species. The present study demonstrates large-scale transcriptome sequencing and in-depth analyses to be a valuable means to resolve the genomics of extremophile adaptation in species with prohibitive genome size.
Materials and methods
Plant material and RNA isolation
Microstrobili (male pollen cones), female strobili, terminal buds, biennial leaves, and cambium tissues originated from five adults of C. gigantea growing in the Giant Cypress Nature Reserve (Nyingchi, Tibet, China). The plant material was collected in May 2014 (active stage of physiological activity), with permission from the local forestry bureau. All sampled tissues were immediately flash frozen in liquid nitrogen and stored at − 80 °C until RNA extraction. Total RNA was isolated from each tissue using a RNeasy Plant Mini Kit (Qiagen, Hilden, Germany). In total, 25 RNA samples were prepared, representing the five tissue types from each of the five sampled trees. RNA was quantified and quality-checked for each sample; RNA was then pooled in one tube in equal quantities for RNA-Seq analysis.
cDNA library preparation and Illumina sequencing
The mRNA library was constructed according to the manufacturer’s instructions using the mRNA-Seq Sample Preparation Kit (Illumina, Inc., San Diego, CA, USA). The poly-(A) mRNA was isolated from the total RNA samples using magnetic oligo (dT) beads. To avoid priming bias, the mRNA was fragmented using an RNA fragmentation kit (Ambion, Austin, TX, USA) before cDNA synthesis. The cleaved RNA fragments were transcribed into first-strand cDNA using reverse transcriptase (Invitrogen, Carlsbad, CA, USA) and random hexamer primers, followed by second-strand cDNA synthesis using DNA polymerase I (New England BioLabs (NEB), Ipswich, MA, USA) and RNaseH (Invitrogen). Short fragments were purified with the QiaQuick PCR extraction kit. Thereafter, the short fragments were connected with sequencing adapters. Following agarose gel electrophoresis, 300-600 bp long fragments were selected for PCR amplification as templates. Finally, the library was sequenced using Illumina HiSeq™ 2500 and sequences were deposited in the GeneBank Short Read Archive (Accession SRX2996533).
De novo transcriptome assembly
Raw reads were filtered to obtain high-quality clean reads by removing adaptor sequences, duplicated sequences and ambiguous reads (reads with unknown nucleotides “N” > 5%) using Trimmomatic (version 0.36) software  with default parameter settings; bases with Phred score < 20 were trimmed. Based on the quality check, the last two base pairs from each read were removed in order to minimize the overall sequencing error. Additional file 1: Figure S1 and Additional file 2: Figure S2 show quality assessment using FastQC . Subsequently, do novo assembly of the transcriptome was carried out with the short read assembly program Trinity using default parameters .
The unigenes generated by Trinity were annotated using the Non-Redundant protein database from NCBI (Nr), KOG  of the Cluster of Orthologous Groups for eukaryotic complete genomes (COG) database , and Gene ontology (GO) protein database , with a cut-off E-value of 1.00E-5. For Nr annotation, we used the Blast2GO program (version 3.1) to obtain the GO annotation of unigenes . After obtaining the GO annotation for each unigene, we used the WEGO software to perform GO functional classification for all unigenes . Unigenes were associated to metabolic pathway constructed by the Kyoto Encylopedia of Genes and Genomes (KEGG) , and this was done using a Blastall search  against the KEGG database.
Gene expansion test
Protein coding sequences from the C. gigantea transcriptome and nine other plants including one lycophyte species (Selaginella moellendorffii, one bryospida species (Physcomitrella patens), two gymnosperm species (Pinus taeda and Picea abies) and five angiosperm species (Amborella trichopoda, Arabidopsis thaliana, Populus trichocarpa, Vitis vinifera and Oryza sativa) were obtained from PLAZA database (http://bioinformatics.psb.ugent.be/plaza/) . Alternatively, spliced and redundant sequences were removed and only the longest isoforms were retained. Filtered sequenced were first grouped with all-by-all comparisons using blastp, and significant hits (E-value < 10− 5) were clustered into gene families with the Markov cluster algorithm (MCL) in the OrthoMCL package (version 2.0.9) . The coding sequences for each gene were individually retrieved and aligned using MAFFT (version 7.335) . The alignments were further concatenated to construct a gene tree for each plant species. We finally constructed the phylogenetic tree between the cypress tree and nine other plant species using RAxML package (version 8.1.24)  under the following parameter settings: “-f a -x 12345 -p 12345 -# 100 -m PROTGAMMAILGX -T 4”. To track the phylogenetic history of gene families and identify expansions, we used the program CAFÉ (version 3.1) following the parsimony rule to reconstruct ancestral states . Functional annotation of specific gene families in C. gigantea was performed using the Trinotate program . Further Gene Ontology (GO) enrichment analysis was conducted by clusterProfile 3.8.1 package (http://bioconductor.org/packages/clusterProfiler/) with false discovery rate (FDR) corrections.
Gene families and protein structure analysis
Unigenes with universal stress protein USPA-like domains were identified within the C. gigantea transcriptome by querying the PLAZA database (https://bioinformatics.psb.ugent.be/plaza/versions/gymno-plaza/)  with the online analytical tool TRAPID (http://bioinformatics.psb.ugent.be/webtools/trapid/) . For comparison, the USPA-like sequences were also retrieved from Arabidopsis thaliana by mining the PLAZA database . To track the evolutionary ancestor of the USPA sequences from C. gigantea, a collection of bacterial USPA proteins was obtained from a previous study  containing two proposed crystal structures of USPs, one from Mechanococcus jannaschii (1MJH) with binding ATP and the other from Haemophilus influenza (1JMV) without binding ATP. Putative USPs sequences were examined using the CDD (conserved Domain Database) (https://www.ncbi.nlm.nih.gov/cdd) and Pfam databases (PF00582) (http://pfam.xfam.org/) to further verify the presence of conserved USPA-like domains. Identical or defective sequences were identified and eliminated by manual inspection in BioEdit . Unigenes with USPA-like domain from C. gigantea, Arabidopsis thaliana and bacteria were retrieved and analyzed.
Putative LRR-RLK unigenes of C. gigantea were identified within the C. gigantea transcriptome assembly using TRAPID (see above). For all obtained LRR-RLK unigenes, we employed CDD (https://www.ncbi.nlm.nih.gov/cdd) to confirm the presence of ECD and KD domains, and TM domains were predicted by querying the TMHMM website (http://www.cbs.dtu.dk/services/TMHMM/) with default parameters of version 2.0. Unigenes not belonging to the LRR-RLK family were rejected. In addition, more than 200 LRR-RLK genes have been retrieved from previous studies where LRR-RLK members were identified in the whole genome sequences of Arabidopsis thaliana  and Amborella trichopoda . A. thaliana and A. trichopoda were chosen as the representatives as we saw they have genomic and functional resources in high quality and also A. trichopoda, one basal angiosperm species, was a good reference for our present study on conifer.
We identified unigenes in the C. gigantea transcriptome that showed significant similarities to known terpene synthase (TPS) genes, again using TRAPID. And, TPS family members were identified for Pinus taeda, Ginkgo biloba and Selaginella moellendorffii which with whole genome sequences available, by mining the PLAZA database (http://bioinformatics.psb.ugent.be/plaza/) . TPS members from the other five conifers (Platycladus orientalis , Abies grandis , Picea abies , Pieca sitchensis , Taxus brevifolia ), and one angiosperm (Eucalyptus grandis ) were also retrieved and analyzed. A preliminary list of hits was created and redundancies were removed. All of the obtained TPS sequences were retained and examined by querying CDD database at NCBI.
Multiple sequence alignments were conducted for all amino acid sequences originating from USPs, LRR-RLKs and TPS families, respectively, using MAFFT version 7.335  following default settings. The aligned sequences were visualized and manually refined with Jalview version 2.0 . Alignments were further filtered using trimAL (version1.3) with gappyout method . Maximum likelihood trees were constructed with Phyml version 3.0  using JTT amino acid substitution model, and the branch support was estimated with approximate likelihood tests and 1000 bootstrap replicates. Phylogenetic trees were visualized and annotated using FigTree v1.4.2 . In addition, we identified conserved motifs for C. gigantea unigenes with Multiple Expectation Maximization for Motif Elicitation (MEME) v.4.11.3 
Real-time quantitative reverse transcription PCR (qRT-PCR)
To validate the presence and the potentially functional divergence among the individual gene members from the same gene family we reconstructed here, qRT-PCR was further executed. Primers were designed with Primer Premier 5.0 software (available from frodo.wi.mit.edu/cgi-bin/primer5/primer5_www.cgi). In total, 35 primer pairs were successfully designed for amplification of 35 genes from 15 groups/subfamilies of the three gene families (USPA, LRR-RLK and TPS). Primer sequences were provided in Additional file 21: Table S9. qRT-PCR was conducted on LightCycler® 96 Thermocycler (Roche, Mannheim, Germany) using SYBR Premix Ex Taq (TaKaRa, Toyoto, Japan). Reactions were prepared in a total volume of 20 μl (containing 1 μl of template, 10 μl of 2 × SYBR Premix, 0.8 μl of each specific primer and 8.4 μl of ddH2O). The reactions conditions were performed as following: 5 min at 95 °C, 40 cycles of 95 °C for 15 s, 60 °C for 20 s and 72 °C for 15 s. Baseline and threshold cycles (Ct) were automatically determined using the LightCycler® 96 software version SW 1.1 (Roche, Mannheim, Germany). Relative gene expression with respect to internal reference gene, Actin 7, was determined with 2-(ΔCt) methods (ΔCt = Ct of the target - Ct of the reference) . Kruskal-Wallis H test was used to test significance of differences on gene expression among different groups/subfamilies. Kruskal-Wallis H test was implemented with ‘kruskal’ function from R package ‘agricolae’ . Before qRT-PCR, reverse transcription polymerase chain reaction (RT-PCR) and PCR, with RNA and DNA as template, respectively, were executed to validate the presence of the assembled gene members. RNA and DNA from leaves of two individual trees were used as templates for these RT-PCR and PCR amplification. We applied 2–3 replicates for one specific amplification.
We thank Prof. Xiao-Ru Wang and Prof. Xiang-Yang Kang for their valuable suggestions.
This study was fund by Fundamental Research Funds for the Central Universities (NO. 2018BLCB08 and YX2013–41).
Availability of data and materials
The datasets analyzed during the current study are available in GeneBank Short Read Archive (Accession SRX2996533).
JFM, IP and YAE conceived and designed the study; SSZ, ZX, HL, XGH, SQJ, KHJ, QG, JX, YQJ, and WZ collected the samples and conducted the experiments; SSZ, ZX, HL, XGH, JFM, IP and YAE wrote the manuscript. All authors read and approved the final manuscript.
Ethics approval and consent to participate
Consent for publication
The authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
- 1.State Forestry Administrtion, People Republic of China. List of national key protected wild plants (first batch), vol. 2. Beijing: Beijing Science Press; 1999.Google Scholar
- 2.Fu LG, Yu FY, Farjon A. Cupressaceae. In: Wu ZY, Raven RH, editors. Flora of China. Beijing: Science Press; 1999. p. 62–77.Google Scholar
- 3.Fu LG, Chin CM. China plant red data book. Beijing: Science Press; 1992.Google Scholar
- 4.Zheng WJ, Fu LG, Cheng J. Chinese gymnosperm. J Syst Evol. 1975;13:56–90.Google Scholar
- 6.Lan XZ, Liao ZH, Wang JS. The diurnal course of photosynthesis of the endangered species Tibetan Cupressus gigantea in Tibet plateau. Acta Ecol Sin. 2005;25:3172–5.Google Scholar
- 7.Zhang G. Studies on the biological and ecological characteristics of endangered plant Cupressus gigantea in Tibet. For Sci Technol. 2006;2:1–5.Google Scholar
- 8.Wang PP, Peng PH, Mao LF. Researches on natural regeneration and community characteristics of Cupressus gigantea (Cupressaceae) population. J Sichuan Forestry Sci Technol. 2010;31:77–80.Google Scholar
- 9.Mao K, Ruhsam M, Ma Y, Graham SW, Liu J, Thomas P, Milne RI, Hollingsworth PM. A transcriptome-based resolution for a key taxonomic controversy in Cupressaceae. Ann Bot. 2018. https://doi.org/10.1093/aob/mcy152.
- 23.Mbaveng AT, Hamm R, Kuete V. 19-harmful and protective effects of Terpenoids from African medicinal plants. In: Toxicological Survey of African Medicinal Plants. Elsevier. 2014; doi: https://doi.org/10.1016/B978-0-12-800018-2.00019-4.
- 36.Natalia Dudareva DM, Kish CM, Kolosova N, Gorenstein N, Fäldt J, Miller B, Bohlmann J. (E)-β-ocimene and myrcene synthase genes of floral scent biosynthesis in snapdragon: function and expression of three terpene synthase genes of a new terpene synthase subfamily. Plant Cell. 2003;15:1227–41.CrossRefGoogle Scholar
- 43.Liu P, Xie L, Li P, Mao J, Liu H, Gao S, Shi P, Gong J. Duplication and divergence of leucine-rich repeat receptor-like protein kinase (LRR-RLK) genes in basal angiosperm Amborella trichopoda. Front Plant Sci. 2016. https://doi.org/10.3389/fpls.2016.01952.
- 47.Andrews S. FastQC: a quality control tool for high throughput sequence data. http://www.bioinformatics.babraham.ac.uk/projects/fastqc. Accessed 4 Oct 2018.
- 61.Hall TA. BioEdit: a user-friendly biological sequence alignment editor and analysis program for windows 95/98/NT. Nucleic Acids Symp Ser. 1999;41:95–8.Google Scholar
- 66.Rambaut A. FigTree v1. 4.2: Tree figure drawing tool. 2015. http://tree.bio.ed.ac.uk/software/figtree. Accessed 9 July 2014.
- 69.De Mendiburu F. Agricolae: statistical procedures for agricultural research. R Package Version. 2014;1:1.Google Scholar
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.