Background

The ascomycetous yeast Komagataella phaffii (previously Pichia pastoris [1]) is routinely used for heterologous protein expression due to its fast growth rates and cheap cultivation conditions—compared to mammalian cell cultures, and its capability to secrete high levels of recombinant proteins [2]. Moreover, a large genetic toolbox is available for K. phaffi making it an interesting workhorse for the production of recombinant proteins in academia as well as the biopharmaceutical industry [3]. K. phaffi was also suggested to be used as a model organism for yeasts as it has undergone a slower evolution than Saccharomyces cerevisiae and retained more characteristics of ancient yeasts and other metazoan organisms [4]. Consequently, different aspects of general biology and especially the protein production and secretion pathway have been studied over the years (reviewed in [3,4,5,6,7,8]). In many of these studies, reverse transcription quantitative PCR (RT-qPCR) analyses are used to determine the transcript levels of native and heterologous genes [9,10,11,12,13,14,15,16,17,18,19,20].

RT-qPCR is a routinely used method for the quantification of individual transcripts in biological samples. In brief, first, the RNA is extracted and then reverse transcribed yielding cDNA. This cDNA is then used as template in a qPCR reaction. The design and choice of primers determine which sequence is to be amplified and therefore quantified. Generally, qPCR assays can be used to calculate the absolute number of targets in a sample, when a set of serial standard dilutions is used to infer a standard curve. However, RT-qPCR assays are mostly performed to compare the abundance of certain transcripts among two or more samples. By an appropriate choice of samples, we can evaluate the influences of different stimuli, cultivation time, developmental state, and many other factors on the transcript abundance of the chosen genes of interest.

Currently, different mathematical methods are used to calculate the relative transcript levels and compare different samples. The commonly used “Delta–delta method” was first described in the Applied Biosystems User Bulletin No. 2 (P/N 4303859). It presumes an identical and perfect efficiency of target and reference genes. In contrast, the model of Pfaffl takes different efficiencies into account [21]. The ‘efficiency calibrated mathematical method for the relative expression ratio in real-time PCR’ was partially published in an internal magazine of Roche (Biochemica No. 2 2001). According to Pfaffl [21], the Roche model is mathematically hard to follow but delivers identical results as Pfaffl’s method. The standard curve based method of Larionov et al. does not need to take efficiencies into account, as it is based on dilution standards [22]. Notably, all these calculation methods rely on the usage of a reference value to compare different samples. Current state-of-the-art is the usage of internal reference values i.e., reference genes. These are genes with constant transcript levels in all samples. Therefore, their abundance in a sample is a representation of the sample itself. The more reference genes transcript is present, the more RNA was isolated and successfully transcribed. Normalization to reference genes aims to mitigate or even nullify differences in the relative abundance of the different types of RNA (rRNA, tRNA, mRNA, etc.) or in the efficiency of the reverse transcription among the different samples.

Based on this central role for the calculation of relative transcript levels, the choice of reference genes is of utmost importance and must be carefully validated. This is stressed in the widely accepted ‘Minimum Information for Publication of Quantitative Real-Time PCR Experiments’ (MIQE) guidelines [23] and was previously discussed and demonstrated in other fungi such as S. cerevisiae [24], Aspergillus spp. [25, 26], and Trichoderma reesei [27]. The MIQE guidelines further state that at least two reference genes should be used for accurate normalization [23]. In K. phaffi, ACT1 (encoding for the cytoskeleton-monomer, actin) is often used as reference gene in RT-qPCR assays [9, 10, 12, 14,15,16,17,18,19,20]. To our knowledge, this gene is used in the K. phaffi community out of habit and due to the lack of validated reference genes.

In this study, we assessed several publicly available RNA-Seq data sets and ranked all annotated genes according to their variation among the samples. We manually curated this list and chose eight stably expressed genes. To test and validate these genes and ACT1, we compiled a test set of 35 independent samples. These samples cover a broad range of typical and atypical cultivation conditions and stresses and originate from three different K. phaffi strains. The transcript levels of the eight potential reference genes and ACT1 were measured in all samples and evaluated with routinely used tools (i.e., the comparative Delta Ct method [28], BestKeeper [29], Normfinder [30], Genorm [31], and RefFinder [32]).

Results

Identification of stably expressed genes by the assessment of RNA-Seq data

To identify potential reference genes in K. phaffii, we searched for stably expressed genes in publicly available RNASeq data (Table 1). The used data sets cover a variety of different culturing conditions and sampling timepoints of the strain GS115. We compared the data sets “Glycerol”, “Methanol”, “Glycerol and Methanol” and “YNBE” to “Glucose” each and rigorously filtered (robust base mean coverage and low log2 Fold Change, details in the Materials and Methods section) for genes with stable expression in each of those four comparisons (Additional file 1: Table S1). Next, we narrowed the gene list down to those genes that appeared in each comparison, giving us a final set of 38 genes (Additional file 2: Table S2).

Table 1 RNA-Seq datasets used in this study

Comparison and evaluation of refences genes

Next, we manually picked 8 genes to be experimentally tested for stable expression in K. phaffii (Table 2). We decided to exclude genes without homologs in S. cerevisiae and without gene description. Further, we omitted all genes whose products are potentially somehow involved in or influenced by mechanisms of protein expression or secretion (e.g., ER-residing proteins). Hence, we focused on genes whose products are involved in nuclear processes (e.g., ribosome synthesis, epigenetics, transcription). We also included the currently often used reference gene, ACT1 in our evaluation experiment, although ACT1 was not identified as stably expressed (Additional file 2: Table S2).

Table 2 Potential reference genes to be empirically evaluated

To test the applicability of the potential reference genes we compiled a sample set from three K. phaffii strains covering a broad range of cultivation conditions, such as different carbon sources, cultivation methods, cultivation stage and different stress types (Table 3). The samples were designed to represent typical cultivation conditions in K. phaffii experiments. Thus, we reason that genes found to be expressed stably in these samples might be suitable reference genes for future transcript analyses. Notably, no biological replicates of the cultivation conditions were performed, as we did not aim to determine the concrete transcript levels and performance of the genes in each sample. In contrast, our goal was to create a sample set with a large diversity to generate data on the overall applicability and robustness of the genes.

Table 3 Samples used for reference gene evaluation

The samples were subjected to RT-qPCR analyses measuring the transcript levels of the potential reference genes. The obtained Ct values (Additional file 3: Table S3) were entered into the RefFinder [32] online tool at http://blooge.cn/RefFinder/. This tool integrates the four commonly used tools, the comparative Delta Ct method [28], BestKeeper [29], Normfinder [30] and Genorm [31] and calculates and comprehensive value based on the results of the four individual results (Additional file 3: Table S3 and Fig. 1). Out of the nine tested genes, RSC1 and TAF10 had the lowest variability and thus the most stable transcript levels. In contrast, ACT1 had a substantially higher “comprehensive gene stability” value (Fig. 1 and Additional file 3: Table S3) and was thus less stably expressed in the tested samples.

Fig. 1
figure 1

Comprehensive gene stability of the tested genes. This value is an dimension-less integration of the stability values calculated by the comparative Delta Ct method [28], BestKeeper [29], Normfinder [30], Genorm [31], and RefFinder [32] and ranks the tested genes accordingly (see also Table A3, Additional file 3: Table S3). Lower values represent higher stability in the tested samples (Table 3)

Discussion

The choice of a reference gene is an essential and crucial part of a transcript analysis via RT-qPCR. The abundance of a reference gene’s transcript shall represent the overall amount of isolated RNA and reverse transcribed cDNA. Consequently, using genes with fluctuating transcript levels is detrimental and prevents a reliable transcript analysis. In this study, we filtered publicly available RNA Seq data set from K. phaffii GS115 for genes with a low overall log2 Fold Change in different data set comparison and manually revised the gene list to obtain potential reference genes. We then tested and evaluated the applicability of these genes experimentally by creating a diverse set of samples from three K. phaffii strains under a broad range of cultivation conditions. The transcript levels of the genes were then compared and evaluated using five broadly used tools for reference gene evaluation.

The obtained results demonstrate that the often-used ACT1 gene is not as stably expressed as other genes (Fig. 1) in our tested samples (Table 3). The stability value (calculated with geNORM finder) of ACT1 of 0.656 lies above the suggested cut-off of 0.5 according to [36]. Further, the average standard deviation (calculated with BestKeeper) of ACT1 lies with 1.029 just above the suggested cut-off of 1.0 according to [29]. This is in accordance with previous findings in S. cerevisiae [24], where ACT1 was also shown to be unsuitable as reference gene. Consequently, we strongly discourage the usage of this gene as reference gene in future RT-qPCR assays in K. phaffii.

The transcript levels of the genes RSC1 and TAF10 exhibited generally low log2 Fold Changes during the comparison of the RNASeq data (Additional file 1: Table S1) and had the lowest “comprehensive gene stability” in our experimental gene evaluation assay. RSC1 is part of the RSC chromatin remodeling complex, while TAF10 is a subunit of both the TFIID complex and the SAGA complex. Both complexes are involved in the gene transcription by the RNA Polymerase II [37]. Based on their biological roles, it comes to no surprise that RSC1 and TAF10 are expressed stably. We consider both genes as suitable reference genes and recommend the simultaneous usage (according to the MIQE guidelines [23]) for relative transcript analyses in K. phaffii.

Materials and methods

Assessment of RNA-Seq data

To search for genes with stable transcript levels, we used 11 publicly available RNA-Seq data sets from K. phaffii GS115. These datasets were derived from three studies and encompass 9 different sampling conditions (Table 1). The RNA-Seq datasets were downloaded from the Sequence Read Archive (SRA) [38] of the National Center for Biotechnology Information (NCBI), with a total of 67.429 G bases and 20.597 Gb. We grouped the different data sets into five sample sets based on the added carbon source (“Glucose”, “Glycerol”, “Glycerol and Methanol”, “Methanol”, “YNBE”) according to the description in the original studies (Table 1). The raw reads were inspected using FastQC v0.11.5 and then analyzed and quality trimmed using Trimmomatic [39]. We extracted a reference transcriptome using gffread v0.12.7 [40] from the reference genome of K. phaffii GS115 [41, 42]. Next, we used salmon 1.4.0 [43] to create a salmon index on the reference transcriptome and quantified each of the datasets, including the –gcBias flag to account for the effects of sample specific biases such as fragment-level GC bias. The quantification results were imported into the R environment and analyzed with the DESeq2 package [44], the tximport package [45], and the Bioconductor package [46].

Next, we compared the following sample sets using DEseq2: glycerol vs. glucose, glycerol and methanol vs. glucose, methanol vs. glucose and YNBE vs. glucose. For each comparison we calculated the Log fold change shrinkage (LFC) the expressed genes. These were then filtered for genes with a base mean coverage of > 500 and < 3000, to avoid false positives and potential sequencing artifacts, and a log2 Fold Change of < 0.05 and > -0.05 to screen for genes with low changes of the transcript levels within the different data sets (same carbon source, different time points). The obtained gene lists were then compared and filtered for genes appearing in each comparison to obtain a list of genes with stable transcript levels in different carbon sources compared to “Glucose”.

Fungal strains and cultivation conditions

The K. phaffi strains CBS 7435, GS115 and BSYBG11 (constitutively expressing a recombinant protein) were pre-cultivated in Yeast nitrogen base media (YNBM) for 20 h at 30 °C, 230 rpm in a shaking incubator (Multitron, Infors HT, Basel, Switzerland). The YNBM consisted of potassium phosphate buffer (pH 6.0), 0.1 M; Yeast Nitrogen Base w/o Amino acids and Ammonia Sulfate (BD Difco, Difco Laboratories Incorporated, part of BD, Franklin Lakes, NJ, USA), 13.4 g/L; (NH4)2SO4, 10 g/L; biotin, 400 mg/L; glucose, 20 g/L [47]. For GS115, Histidine was added to cultivation resulting in a final concentration of 20 mg/L. Once the complete substrate in the initial culture was consumed (at line-determined via Cedex measurements, as described previously [48]), the cultures were used as inoculum (representing 10% of the final volume) for further cultivations. The pre-cultures were used to inoculate shake flasks containing specific medium and conditions (Table 3, samples 2–29). Samples were taken after 1 h or 8 h. Sample 1 was taken from the pre-culture of CBS 7435 directly after glycerol depletion, whereas sample 30 was taken 8 h after glycerol depletion from the pre-culture of BSYBG11.

The low salt lysogenic broth (LB) medium consisted of tryptone, 10.0 g/L; NaCl, 5.0 g/L; yeast extract, 5.0 g/L. The BMGY media consisted of yeast extract, 10.0 g/L; peptone, 20.0 g/L; potassium phosphate buffer (pH 6.0), 0.1 M; Yeast Nitrogen Base w/o Amino acids and Ammonia Sulfate (Difco), 13.4 g/L; biotin, 400 mg/L; glycerol, 10.0 g/L. The DSMZ medium consisted of yeast extract, 3.0 g/L; malt extract, 3.0 g/L; peptone from soybeans, 5.0 g/L; glucose, 10.0 g/L. Basal salt medium (BSM) consists of 85% (v/v) phosphoric acid, 26.7 mL/L; CaSO4*2H2O, 1.17 g/L; K2SO4, 18.2 g/L; MgSO4*7H2O, 14.9 g/L; KOH, 4.13 g/L; glycerol, 20 g/L supplied with trace elements [49].

The strain BSYBG11 constitutively expressing a recombinant protein was also cultivated in a Minifors 2 bioreactor system (max. working volume: 2 L; Infors HT, Bottmingen, Switzerland). The cultivation offgas flow was analyzed online using offgas sensors—IR for CO2 and ZrO2 based for O2 (Blue Sens Gas analytics, Herten, Germany). Process control and feeding was performed using EVE software (Infors HT, Bottmingen, Switzerland). The pH was monitored using a pH-sensor EasyFerm Plus (Hamilton, Reno, NV, USA). During the cultivations pH was kept constant at 5.0 and was controlled with base addition only (12.5% NH4OH), while acid (10% H3PO4) was added manually, if necessary. Temperature was kept constant at 30 °C. Aeration was carried out using a mixture of pressurized air and pure oxygen at 2 vvm to keep dissolved oxygen (dO2) always higher than 30%. The dissolved oxygen was monitored using fluorescence dissolved oxygen electrode Visiferm DO (Hamilton, Reno, NV, USA). At the end of the batch phase, represented by a sudden drop in CO2 signal and a parallel increase in the dO2, methanol pulse (0.5% v/v, supplied with basal media trace element stock solution) was added to the bioreactors for metabolism adaptation (Table 3, samples 32–33). After 24 h from adaptation pulse, fed-batch cultivation started and lasted 42 h. Mixed-feed (80 g/L methanol mixed with 400 g/L glycerol) was used for samples 32 and 33 (Table 3) while for samples 34 and 35, (Table 3) a derepressed feeding strategy was applied by setting feeding rate at a limiting level.

RNA extraction

Approx. 0.1 g of yeast cells were resuspended in 1 ml RNAzol RT (Sigma-Aldrich) and lyzed using a Fast-Prep-24 (MP Biomedicals, Santa Ana, CA, USA) with 0.5 g of glass beads (1 mm diameter) twice at 6 m/s for 30 s. Samples were incubated at room temperature for 5 min and then centrifuged at 12,000 g for 5 min. 650 µl of the supernatant were mixed with 650 µl ethanol and RNA isolated using the Direct-zol RNA Miniprep Kit (Zymo Research, Irvine, CA, USA) according to the manufacturer’s instructions. This Kit includes a DNAse treatment step. The concentration and purity were measured using a NanoDrop ONE (Thermo Fisher Scientific, Waltham, MA, USA).

RT-qPCR assays

500 ng of isolated total RNA was reverse transcribed using the LunaScript RT SuperMix (NEB) according to the manufacturer’s instructions. The resulting cDNA was diluted 1:50 and 2 µl were used as template in a 15 µl reaction using the Luna Universal qPCR Master Mix (NEB) according to the manufacturer’s instructions. Used primers are listed in (Additional file 4: Table S4). All reactions were performed in technical duplicates on a Rotor-Gene Q system (Qiagen, Hilden, Germany).