Background

Real-time quantitative PCR (RT-qPCR) is the most widely used technique for relative gene quantification because of its good repeatability, high sensitivity, strong specificity, high throughput, simplicity, speed, and low cost [1,2,3]. However, the biological variability of initial materials and the technical factors involved in sample preparation, such as the quantity of cDNA, RNA extraction, RNA integrity, and storage conditions will inevitably affect the accuracy of RT-qPCR [4,5,6]. Therefore, normalization is necessary to correct for variations in template quantity. Reference genes are used for the normalization of gene expression because of the stability of expression levels among different tissues, different developmental stages, or under various treatments [5].

In general, constitutively expressed housekeeping genes are used as reference genes, such as actin, elongation factor, glyceraldehyde-3-phosphate dehydrogenase, ribosomal RNA, translation initiation factor, tubulin, and ubiquitin [7,8,9]. However, many studies have shown that some of these genes are not always stable and their expression levels vary greatly under specific experimental conditions [6, 10, 11]. This is especially true for non-model organisms, which currently lag behind well-characterized model organisms in terms of genomic resources and empirically tested reference genes [4, 6, 10, 11]. Moreover, recent studies have shown that it is impossible to totally normalize gene expression data from all sample types using a single gene [6, 10]. Therefore, two or more reference genes are desirable to improve the reliability and accuracy of the RT-qPCR results. With the increasing application of high-throughput sequencing, RNA-seq has provided a new strategy for identifying new highly stable reference genes from transcriptome data. Heretofore, identification of many novel reference genes has been performed based on transcriptome data in various organisms [10,11,12,13,14,15,16,17,18,19,20,21,22].

The Echiura worm Urechis unicinctus, a typical benthic species living in intertidal sediments, is widely distributed in Russia, Korean Peninsula, Japan and China [23]. U. unicinctus possesses high economic value because of great edible value and potential medical value [23]. U. unicinctus is also mostly used to study gametogenesis [24], development [25,26,27,28,29,30,31], evolution [32, 33], and sulfide metabolism [34] because of its characteristics such as a large number of eggs laid, high fertilization rate, biphasic life cycle, and high sulfide tolerance ability. Recently, the use of U. unicinctus in evo-devo studies has generated many breakthrough [32, 33]. Evolutionary transcriptome analysis of the trochophores of U. unicinctus and other metazoan animals reveal an adult-first evolutionary scenario with a single metazoan larval intercalation [32]. Hox-mediated body plan diversification is an important developmental process [35]. In U. unicinctus, the expression of Hox genes exhibits a subcluster-based whole-cluster spatio-temporal collinearity pattern, suggesting that Hox subcluster play an important role in spatio-temporal collinearity pattern in invertebrates [33]. On the other hand, as a species living in the intertidal zone, U. unicinctus can tolerate, metabolize and utilize environmental sulfide and is considered a model species for sulfide adaptation [34, 36,37,38,39,40,41,42,43,44,45].

At present, gene expression analysis by RT-qPCR has been widely performed in U. unicinctus, using commonly used reference genes, such as ATPase [29,30,31, 33, 46] and β-actin [34, 36, 40, 41, 43, 47,48,49,50,51]. Previous studies have identified some reference genes, but have generally focused on some traditionally used genes, such as EF-1-α, TBP, TUB, eIF3, and ATPase [52, 53]. Suitable reference genes are crucial for verifying the expression profiles of related genes for future studies in U. unicinctus. RNA-Seq, which can provide a large amount of gene transcription information, is a better method for reference gene screening [10,11,12,13,14,15,16,17,18,19,20,21,22]. In addition to classical housekeeping genes, transcriptome data analysis provides an opportunity to identify novel and more stable reference genes. In recent years, a large amount of transcriptome data of U. unicinctus has been published [32, 45, 46, 54], which provides a new strategy for selecting housekeeping genes or reference genes in U. unicinctus.

In this study, we systematically screened reference genes by analyzing transcriptome data including early development, normal adult tissues, and the hindgut under sulfide stress in U. unicinctus. Candidate reference genes were selected from the three datasets. Moreover, the correlation between the Ct of RT-qPCR and transcripts per million (TPM) of the transcriptome data was investigated. Our findings identified novel stable reference genes from transcriptome data and contributed to the accurate quantification of gene expression in U. unicinctus.

Results and discussion

Identification of the candidate reference genes from transcriptome data

In this study, we systematically screened reference genes based on transcriptome datasets from early developmental embryos and larvae, normal adult tissues, and the hindgut under sulfide stress using the coefficient of variation (CV) method in U. unicinctus. The CV method is simple to use for candidate reference gene selection. Moreover, compared with other methods such as the fold change method, the CV method can quantify expression variability in a way in which genes can be ranked and directly compared, which has previously been used to identify novel reference genes from transcriptome data in plant species such as the monkeyflower genera Mimulus luteus, Polygonum cuspidatum, apple, and Lycium barbarum L [12, 16, 21, 55] and animals such as Mizuhopecten yessoensis, and silkworm Bombyx mori [17, 18]. Although both reads/fragments per kilobase per million (RPKM or FPKM) and TPM can be used to measure gene expression levels, RPKM and FPKM may not be applicable to the comparison of gene expression levels because of the differences sequencing depth between samples. TPM is more suitable for comparison of expression levels among samples [11, 56]. First, we excluded genes with low expression levels for easy detection in RT-qPCR assays and adopted a minimum mean log2(TPM) cut-off of 5 as a criterion for gene expression levels. Second, to ensure that the reference genes had low variance, a standard deviation (SD) log2(TPM) value of less than 1 was required. So a 0.2 CV cut-off was applied to further identify reference genes, which has been recommended in previous studies [11]. Based on these criteria, we identified 6834 (15.82%), 7110 (16.85%) and 13880 (35.87%) candidate reference genes from 43209 genes of early developmental embryos and larvae, 42191 genes of different normal adult tissues, and 38690 genes of the hindgut under sulfide stress. The number of candidate reference genes for early development was lowest. This result was expected because gene expression levels can change dramatically in a short time during early development [11]. Further, the expression levels of the candidate reference genes were analyzed. The results indicated that the median log2(TPM) values of the candidate reference genes were 14.162 in the early developmental stages, 15.389 in normal adult tissues and 16.317 in hindgut under sulfide stress (Fig. 1A). The ten most stable genes with the lowest CV values in early developmental embryos and larvae, different normal adult tissues, and hindgut under sulfide stress are listed in Table 1. The mean log2(TPM) and CV values of the ten most stable genes in the early development stages ranged from 14.317 to 14.329 and 0.0157 to 0.0203, respectively. All genes were annotated, of which six encoded proteins (FXRD1, IF2M, STX12, PTCD3, TCF25, and TFG) related to gene transcription, translation, protein transport and assembly, and three encoded proteins (CAPR1, NSMA, PDCD6, and HBS1L) related to cell growth, proliferation and apoptosis. As for the ten most stable genes in normal adult tissues, their mean log2(TPM) and CV values ranged from 15.524 to 15.532 and 0.0093 to 0.0138, respectively. Nine genes were annotated, of which six were encoded proteins (UBE4A, OGT1, GGA3, EHMT1, TRA2B, and PRP39) related to protein modification, transport and mRNA splicing. The ten most stable genes in the hindgut under sulfide stress showed mean log2(TPM) values ranged from 16.3457 to 16.3463 and CV values ranging from 0.0018 to 0.0032. Nine genes were annotated, of which three (CNOT1, RBM26, and HTR5B) encoded proteins related to post-transcriptional regulation.

Fig. 1
figure 1

The selected candidate reference genes in early development, normal adult tissues, and the hindgut under sulfide stress. A Boxplot exhibiting the log2TPM values of candidate reference genes that met the criteria. B Venn diagram showing the relationships between candidate reference genes under three conditions. Pink, early development; blue, normal adult tissues; green, hindgut under sulfide stress

Table 1 The information of the top 10 candidate reference genes in the early development, normal adult tissues and hindgut under sulfide stress of the U. unicinctus

Functional enrichment analysis of candidate reference genes

To further analyze the relationships of candidate reference genes in early development, normal adult tissues, and the hindgut under sulfide stress, we compared the three candidate reference gene datasets. As shown in Fig. 1B, 4079 genes were shared in three candidate reference gene datasets, 4184 genes were shared in early development and normal adult tissues, 5790 genes were shared in early development and hindgut under sulfide stress, and 6615 genes were shared in normal adult tissues and hindgut under sulfide stress. Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) enrichment analyses of the three candidate reference gene datasets were then performed (Table 2). In early development, GO enrichment analysis showed that the candidate reference genes were mainly enriched in biological process (BP) terms associated with cellular protein metabolic process and macromolecule metabolic process, and in molecular function (MF) terms related to binding. These genes were also enriched in cellular component (CC) terms associated with intracellular and cell. KEGG pathway enrichment analysis indicated ribosome was the most significant pathway, followed by proteasome. In different adult tissues, GO enrichment analysis showed that the candidate reference genes were mainly enriched in biological process (BP) terms associated with cellular protein metabolic process and intracellular transport, and in molecular function (MF) terms related to binding. These genes were also enriched in cellular component (CC) terms associated with intracellular and cell. KEGG pathway enrichment analysis indicated ribosome was the most significant pathway, followed by oxidative phosphorylation. In the hindgut under sulfide stress, GO enrichment analysis showed that the candidate reference genes were mainly enriched in biological process (BP) terms associated with protein phosphorylation and in molecular function (MF) terms related to protein binding. These genes were also enriched in cellular component (CC) terms associated with intracellular, cell and organelle. KEGG pathway enrichment analysis revealed diabetic cardiomyopathy was the most significant pathway, followed by the oxidative phosphorylation. In summary, the results of GO and KEGG enrichment analysis in the hindgut under sulfide stress were different from those in early development and different normal adult tissues, with the proportion of specific candidate reference genes belonging to the hindgut under sulfide stress being the largest, suggesting that the hindgut under sulfide stress, early developmental embryos and larvae, and different adult tissues may focus on diverse biological processes or pathways. Therefore, we need to screen for optimal reference genes in early development, different normal adult tissues, and the hindgut under sulfide stress.

Table 2 GO and KEGG pathways enriched in the candidate reference genes

Validation of candidate and commonly used reference genes expression stability by RT-qPCR assay

Five candidates and five commonly used reference genes were selected for validation and comparison of the expression stability in this study (Additional file 1: Table S1). Five of the top ten candidate reference genes were chosen for RT-qPCR in each case, as shown in Table 1. The five candidate reference genes were FXRD1, CAPR1, NSMA, IF2M, and STX12 in early development; UBE4A, OGT1, EHMT1, GGA3, and TRA2B in normal adult tissues and CNOT1, CLH1, EXOC6, LYAG, and TRFM in the hindgut under sulfide stress. The five commonly used reference genes were ATPase B, TBP, eIF3, ACTB, and GAPDH (Table 3).

Table 3 Detailed information on the five commonly used reference genes in the early development, normal adult tissues, and the hindgut under sulfide stress of the U. unicinctus

Boxplots were constructed to present the expression levels of five candidate reference genes and five commonly used reference genes under all three conditions (Fig. 2). As shown in Fig. 2, Table 1, and Table 3, the novel candidate reference genes possessed higher stability than the commonly used reference genes in all three conditions. The variances in the commonly used reference genes were different among three cases. Three of the five commonly used reference genes, ACTB, GAPDH, and TBP are unstable during early development. ACTB was the most unstable reference gene in early development and normal adult tissues. GAPDH have the highest variance in the hindgut under sulfide stress, followed by TBP.

Fig. 2
figure 2

The boxplots that show the log2TPM values of the five commonly used reference genes and five candidate reference genes in early development (A), normal adult tissues (B) and hindgut under sulfide stress

To further examine the results of transcriptome analysis, RT-qPCR experiments were carried out, and the expression level constancy of ten reference genes in different cases was assessed by four data processing methods (geNorm, NormFinder, BestKeeper, and ∆Ct) [57,58,59,60]. Despite a slight difference in the samples used between transcriptome data and RT-qPCR, the results of transcriptome analysis and RT-qPCR assay are very similar, which suggests that candidate reference genes have higher stability than most of the commonly used genes. As shown in Fig. 3A, during early development, syntaxin-12 (STX12) was the most stable gene, which is a member of the syntaxin family localized to the endosome [61]. The syntaxin family belongs to the t-SNARE subfamily of the SNARE superfamily and is involved in vesicle trafficking [62, 63]. STX12 is widely expressed and potentially participates in a common trafficking event that occurs in every cell [64,65,66], which explains why STX12 showed stable expression levels during early development. The most stable reference gene in different adult tissues was euchromatic histone-lysine N-methyltransferase 1 (EHMT1) (Fig. 3B). EHMT1 and euchromatic histone-lysine N-methyltransferase 2 (EHMT2) are highly homologous and generate functional heterodimeric complexes that are mainly responsible for mono- and dimethylation of histone H3 lysine 9 (H3K9) in euchromatin [67]. EHMT1/EHMT2 is essential for maintaining the normal methylation patterns of H3K9 and plays a central role in the epigenetic control of euchromatin, which is vital for normal cell function. They are universally expressed and associated with many biological processes [67,68,69,70,71]. EHMT1 is also required for normal levels of DNA methylation in facultative heterochromatin [72]. The stable expression levels of EHMT1 and its key role in cells enabled its use as a reference gene in normal adult tissues. As to hindgut under sulfide stress, the most stable gene was lysosomal alpha-glucosidase (LYAG) (Fig. 3C), which is a retaining exo-glucosidase catalyzing the production of glucose from glycogen in lysosomes [73, 74]. LYAG is extremely important for the degradation of glycogen in lysosomes [75]. This defect can cause the substrate to accumulate in almost all body tissues [76]. Alpha-glucosidases have weak specificity and a given substrate is not strictly connected to a single type of protein [77]. The expression levels of LYAG were not affected by sulfide treatment. Therefore, STX12, EHMT1 and LYAG can be selected as reference genes to normalize the results of the RT-qPCR assay in early development, normal adult tissues, and hindgut after sulfide stress in U. unicinctus, respectively.

Fig. 3
figure 3

Expression stability of the ten genes in early development (A), normal adult tissues (B), and hindgut under sulfide stress (C). Stability was estimated using geNorm, NormFinder, BestKeeper, and ∆Ct analyses of the RT-qPCR data. The genes were ranked in descending order of expression stability from left to right after comprehensive analysis

Compared to the novel screened candidate reference genes, most of the commonly used reference genes had lower comprehensive ranking values. During early development, ACTB was the most variable, followed by TBP, which is consistent with the transcriptome data of the two genes with high CV values (Fig. 3A, Table 3). Similarly, in different normal adult tissues, ACTB had the lowest comprehensive analysis ranking values of RT-qPCR and the highest variance, which was not included in the candidate reference gene list. TBP and ATPase B were more stable with relatively high rankings of stability by RT-qPCR compared with other traditional reference genes (Fig. 3B), which is expected because the two genes are candidate reference genes that pass the criteria and have lower CV values in five commonly used reference genes (Table 3). The results of the comprehensive analysis in early development and normal adult tissues are in accordance with those of previous studies in U. unicinctus [52]. During sulfide stress in the hindgut, ACTB was the most unstable in the comprehensive RT-qPCR results (Fig. 3C). However, there was also some inconsistency between the results of RT-qPCR and transcriptome data analysis. For example, GAPDH, which ranked second after STX12 in the RT-qPCR results during early development (Fig. 3A), was not included in the candidate reference gene list with a 1.23 of SD log2(TPM) value by transcriptome data analysis. We deduced that the inconsistent phenomenon may result from the difference in the samples between the transcriptome data and RT-qPCR.

Relationship of gene expression level between transcriptome data and RT-qPCR

Previous studies have suggested that there is a high correlation between RNA-Seq data and Ct value of RT-qPCR [11]. Therefore, we assessed the relationship between the TPM values of the transcriptome data and RT-qPCR data. As shown in Fig. 4, there was a significant negative correlation between log2(TPM) and Ct values (R2 = 0.0453, P < 0.0001), with the formula Ct =  − 0.5405 log2(TPM) + 34.51. This formula will contribute to the estimation the Ct value based on transcriptome data without executing the RT-qPCR assay, and will be conducive to our further research.

Fig. 4
figure 4

The correlation of gene expression between log2TPM of transcriptome data and Ct values of RT-qPCR

Conclusions

In this study, we identified candidate reference genes for embryos and larvae of early developmental stages, normal adult tissues, and the hindgut under sulfide stress based on transcriptome data from U. unicinctus. We then validated of the candidate reference genes by RT-qPCR using four methods (geNorm, NormFinder, BestKeeper, and ∆Ct) and compared the stability between the candidate reference genes and commonly used reference genes. The results showed that STX12, EHMT1, and LYAG are the most stable candidate reference genes in early development, normal adult tissues, and the hindgut under sulfide stress, respectively. Our study indicates that transcriptome analysis approaches have great potential to discover novel stable reference genes and will contribute to future gene expression level research in U. unicinctus.

Materials and methods

Animals materials and treatments

Adult U. unicinctus were collected from the intertidal zone along the coast of Yantai city, China. They were maintained in aerated seawater (19℃, pH8.0, salinity 30 PSU) and raised with Chaetoceros muelleri, Chlorella vulgaris, and Platymonas helgolandica.

We selected three healthy adult worms and dissected six tissues, including the body wall, coelomic fluid, foregut, mid-gut, hindgut and anal sac from each individual in phosphate-buffered saline (PBS, pH7.4). After dissection, the tissues were immediately frozen in liquid nitrogen and stored at -80℃.

Sexually mature individuals were selected and dissected to acquire mature ova and sperm from nephridia (gonoducts) during the spawning season. Sperm and ova were then mixed for artificial insemination at ratio of 10:1. Fertilized eggs were reared in filtered seawater (FSW) (17℃, pH 7.9, salinity 30). Embryos and larvae from ten developmental stages, including early cells (EC, 2 cells, 4 cells, and 8 cells), multiple cell (MC, generally more than 32 cells), blastulae (BL), gastrulae (GA), early trochophore larva (ET, 1 d post fertilization, dpf), mid-trochophore larva (MT, 2 dpf), late-trochophore larva (LT, 25 dpf), early segmentation larva (ES, 30 dpf), segmentation larvae (SL, 35 dpf) and worm-shaped larvae (WL, 42 dpf), were collected, frozen immediately in liquid nitrogen and then stored at -80℃ for total RNA extraction. Three biological replicates were prepared for each developmental stage.

The experimental system and sulfide treatment were conducted as described previously [45]. We prepared three aquariums containing 30 L of seawater and sealed them with a cling film. Six individuals were randomly selected for placement in each aquarium. The sulfide concentration in seawater was maintained at 50 μM (equivalent to moderately polluted sediment that U. unicinctus can live normally) by adding the sulfide stock solution (10 mM Na2S, pH 8.0) at 2 h interval, and detected by the methylene blue method [41]. The hindguts of three individuals from each aquarium were dissected 0 (control), 6, 24, and 48 h after sulfide treatment. The hindgut was immediately frozen in liquid nitrogen and stored at − 80 °C.

Transcriptome datasets

Transcriptome data of U. unicinctus embryos and larvae at various developmental periods were obtained from the NCBI Sequence Read Archive (SRA) database under the accession numbers PRJNA485379 and PRJNA394029, mainly including the following stages, EC: Early cells; MC: Multicellular; BL: Blastula; GA: Gastrula; ET: Early-trochophore; MT: Mid-trochophore; LT: Late-trochophore; SL: Segmentation larva and WL: Worm-shaped larva. The transcriptome data of the normal adult tissues were also obtained from the NCBI SRA database under the accession number PRJNA917787, which mainly included the body wall, coelomic fluid, foregut, mid-gut, hindgut, and anal sac. Transcriptome data of the sulfide stress hindgut samples (50 μM for 0, 6, 24, and 48 h) were also obtained from the NCBI SRA database under the accession number PRJNA752504.

Identification of reference genes based on transcriptome data

Reference genes for RT-qPCR were selected using the coefficient of variation (CV) method as previously described [11]. TPM values were used to measure gene expression levels and averaged for subsequent analyses of biological replicates. Firs, genes with log2(TPM) values less than or equal to 5 were excluded, because these low-expression genes would lead to poor RT-qPCR results, which makes it difficult to detect and quantify their expression. CV values were calculated using the formula CV = standard deviation (SD) of log2(TPM) / average log2(TPM) (mean). Calculations for the mean, SD, and CV were implemented in Microsoft Excel. Candidate reference genes with low variances were required, with SD values lower than 1. Therefore, a CV cut-off of 0.2 for stable genes was adopted, which was the cut-off for stable expression across heterogeneous genes.

Functional enrichment analysis

To further understand the functions of the selected candidate reference genes, Gene Ontology (GO) terms and Kyoto Encyclopedia of Genes and Genomes (KEGG) enrichment analysis were performed. The Swiss-Prot Blast results for all genes and the results of reference genes were imported into the online software OmicShare Tools (https://www.omicshare.com/tools/home/index/index.html), and the GO and KEGG enrichment analysis was completed using the Bioinformatics Cloud Tool Platform [78,79,80].

RNA isolation and cDNA synthesis

Total RNA was extracted from the stored different samples using TRIzol reagent (Invitrogen, Carlsbad, CA, USA), according to the manufacturer’s instructions. The RNA quality was assessed using NanoDrop 2000 (Thermo Scientific, Wilmington, DE, USA) and agarose gel electrophoresis. Then, the cDNA template was prepared using a PrimeScript™ RT reagent kit with gDNA Eraser (TaKaRa, Dalian, China) following the manufacturing’s instruction, and diluted with distilled water (1:10) for subsequent experiments.

Validation of the reference gene expression stability by RT-qPCR assay

Ten genes, consisting of five novel candidates and five commonly used reference genes, were chosen for RT-qPCR validation in early development, normal adult tissues, and the hindgut under sulfide stress. Primers were designed using Primer Primier software (5.0) and the primer sequences are listed in Additional file 1: Table S1.

RT-qPCR was performed on Light Cycler 480 system (Roche, Basel, Switzerland) using SYBR Premix Ex TaqTM (TaKaRa, Dalian, China). All reactions were carried out with three sample replicates and three technical replicates, and all RT-qPCR assays were validated in compliance with “MIQE guidelines” [81].

Four statistical approaches, geNorm (https://genorm.cmgg.be/) [57], NormFinder (http://moma.dk/) [58], BestKeeper (www.gene-quantification.com/bestkeeper.html) [59], and ∆Ct method [60], were applied to estimate the expression stability of the reference genes. The final ranking of gene expression stability was determined by calculating the geometric mean values of the results acquired using the four approaches.

Statistical analysis

Statistical analysis of the correlation between FPKM and Ct values was performed using one-way analysis of variance (one-way ANOVA). Statistical significance was set at P < 0.05.