Background

During the last decade, remarkable advances have been achieved in clinical medicine by the application of DNA microarray technology. Molecular signatures relevant to the diagnosis, prognosis and therapy have been discovered for numerous diseases [1,2,3]. In recent years, RNA-sequencing (RNA-seq) has been recognized as an attractive alternate technology for the same purpose. Compared to microarrays, RNA-seq provides a more comprehensive profiling of the transcriptome, with better quantitation, over a wider dynamic range, while allowing single base resolution, and detection of isoforms, RNA editing events, microRNAs and long noncoding RNAs [4,5,6]. The technology has been refined sufficiently to allow mRNA profiling of single cells. Challenges among the application of RNA-seq in clinical medicine include the need for an experimental design that includes sufficient numbers of biologic and technical replicates, and implementation of a mathematically valid bioinformatics pipeline to mine the large volume of data generated at a reasonable cost [7, 8].

The application of RNA-seq to the allograft kidney is at a very rudimentary stage. Rigorous definition of housekeeping genes (HKG) is essential for further progress in this field. HKG can be defined as genes ubiquitously expressed in all tissue compartments and cell-types regardless of their developmental stage, physiological condition and exposure to external stimuli [9]. HKG used in traditional clinical studies and classical biology experiments include 18S RNA, 28S RNA, tubulins, beta-actins, and glyceraldehyde-3-phosphate dehydrogenase (GAPDH). However, it is known that the expression of these genes is not constant through the cell cycle, and is further altered in response to injurious stimuli. Indeed 18 s RNA Is one of the biomarkers associated with acute rejection [10]. Actin is upregulated in chronic allograft dysfunction [11]. Tubulin is targeted by Colchicine, a drug used in patients with gout: it inhibits microtubule polymerization by binding to tubulin and block mitosis by acting as a ‘spindle poison’ [12]. These examples illustrate how use of single genes or a limited set HKG can be inherently problematic.

One potential solution to the problem is to use bioinformatics techniques and derive large HKG data sets for evaluation of high throughput gene expression data. This will ensure that alteration of a small number of genes due to experimental conditions does not unfavorably affect the overall data analysis. Accordingly, this study has developed HKG gene sets appropriate for assessment of differential gene expression using nine different nine normalization methods that include library size, total counts (TC), upper quartile (UQ), Median, Quantile, trimmed mean of M -values (TMM), reads per kilobase million (RPKM), transcripts per kilobase million (TPM) and DESeq. HKG lists are offered that are specific to particular normalization paradigms. In addition, there is a universal set of 42 housekeeping transcripts that are common to all nine individual analyses.

Methods

Clinical material

This study was approved by the University of Pittsburgh IRB (protocol # 10110393). Formalin fixed paraffin embedded renal allograft biopsies (n = 25) were derived from recipients diagnosed with acute tubular injury (ATI; n = 5), T cell-mediated rejection (TCMR; n = 5), interstitial fibrosis and tubular atrophy (IFTA; n = 5), and BK virus-associated nephropathy (BKVN; n = 5), as well as recipients with stable allograft function (STA; n = 5). Five native kidney biopsies with interstitial nephritis (ISN; n = 5) were also studied. The clinical and pathology parameters pertinent to these specimens have been published previously [13].

RNA sequencing

RNA was extracted from 1 cubic mm pieces of formalin fixed paraffin embedded biopsy tissue using the Invitrogen PureLink™ FFPE RNA Isolation Kit (Catalog number: K156002), which includes a melting buffer to remove paraffin and a Proteinase K digestion step. cDNA libraries were constructed from 100 ng total RNA obtained using the Ion Ampliseq Transcriptome Human Gene Expression Kit from Life Technologies (Cat# A26325) and the manufacturers recommended protocol. Ampliseq Transcriptome analysis was performed by PrimBio Research Institute LLC, Exton, PA, USA, using an Ion Proton sequencer Ion Proton P1 chips, IonXpress barcodes, and Torrrent_Suite 5.0.4 software according to the manufacturer’s instructions. Briefly, Library Amp Primers were employed to amplify the purified cDNA library by PCR, and the yield and size of distribution of each library was run on Agilent 2100 Bioanalyzer. Approximately 100 pM of pooled barcoded libraries were used for templating using the Life Technologies Ion Chef Kit. Raw sequence files (fastq) were aligned to the human transcriptome (hg19) reference sequences in StrandNGS software. Gene and transcript annotations were retrieved from the Ensembl database to generate aligned SAM files, which were filtered on read quality (> 15), alignment score (≥90), match count (≤1) and mapping quality (≥25). RNA-seq quality control data on these biopsies has been published has been published [14]. RNA purity assessed by the A260/A280 ratio ranged from 1.87 to 2.0. RNA fragments of greater than 200 nucleotides in length comprised greater than 30% of the total RNA concentration. The mean sequence length in this RNA-Seq data set ranged from 66 to 117 nucleotides. Greater than 98.5% of the reads aligned to the human transcriptome with accuracy rates of greater than 97%. Our data has been submitted to the GEO database (GSE120495).

Definition of HKG/normalization methods

The term HKG has been conceived to refer to genes responsible for maintenance of fundamental cellular function. These genes are ubiquitously expressed at approximately the same level in all cell-types regardless of developmental stage, physiological condition and presence of external stimuli [9, 15]. In this study, genes with expression coefficients of variance (CV) corresponding to the 2nd percentile across all 30 samples were assigned to the HKG category as has been suggested in the literature [16]. In a dataset of 47,613 genes this corresponded to 952 genes with the lowest CV. CV was calculated as the ratio of the standard deviation (SD) σ to the arithmetical mean μ of each gene.

Comparison of RNA-seq expression values across multiple samples requires normalization of data. Several normalization algorithms have been described in the literature, and we explored nine different methods, namely, library size, TC, UQ, Median, Quantile, RPKM, TPM, TMM and DESeq. Briefly, library size refers to the number of reads that aligned to the human genome. TC refers to transcript counts that remained after removing genes with an expression value of zero in all samples. The UQ scaling factor was calculated as a ratio of the 75th percentile of counts for each sample divided by the mean 75th percentile in all 30 samples [17]. The median scaling factor was obtained in the same manner using the 50th percentile [18]. Quantile normalization was implemented in R software using the “normalizeQuantiles()” function in the EBSeq package (Bioconductor version 3.6). This method sorts the test and reference distributions and proceeds to assign the highest value in latter to the highest value in the former [19]. The RPKM method attempts to normalize first for sequencing depth (per million reads) and then gene length (expressed in kilobases) [20]. TPM normalization proceeds in the reverse order: first, the raw read counts are divided by the length of the gene in kilobases, and then divided by the “per million” scaling factor [21]. TMM normalization was performed using the “calcNormFactors()” function in the edgeR package. The TMM method calculates a scaling factor based on a weighted trimmed mean of log gene expression ratios based on the assumption that most genes are not differentially expressed. Weights are assigned to account for the fact that genes with larger RNA-seq counts have lower variance, and data from both the upper and lower ends are trimmed prior to deriving a scaling factor for the sample library size [22]. Finally, DESeq normalization was implemented in DESeq package by calling the “estimateSizeFactors()” and “sizeFactors()” functions, which are also based on the hypothesis that most genes in the RNA-seq are not differentially expressed [23]. The performance of different normalization methods on our dataset was compared by calculating the bias and variance of genes in each HKG set [24]. The following formulae were used for the calculation of bias and variance, respectively:

$$ {\mathrm{Bias}}_{\mathrm{i}}=\kern0.5em \sqrt{\frac{1}{n}\kern0.5em {\Sigma}_{j=1}^n\kern1em {\left({\log}_2\left(\frac{ Ki j}{\overline{Ki.}}\right)\right)}^2} $$
$$ {\mathrm{Variance}}_{\mathrm{i}}=\frac{1}{n-1}{\sum}_{j=1}^n{\left({\log}_2\left(\frac{ Ki j}{\overline{Ki.}}\right)-{\log}_2{\left(\frac{ Ki j}{\overline{Ki.}}\right)}_i\right)}^2 $$

In these formulae, the Kij represents the normalized read counts for ith gene from the jth sample, where the \( \overline{Ki.} \) is the mean value of normalized read counts of each gene across 30 samples.

Validation of HKG using published datasets

It was reasoned that genes classified HKG in this study would have minimal representation in lists of genes known to be differentially expressed in disease states that affect the kidney. Accordingly, we sought overlaps between the HKG dataset, and published gene sets derived from biopsy with T-cell mediated rejection, antibody mediated rejection, polyomavirus nephropathy, and chronic allograft damage [25,26,27,28]. Probe sets used to define disease associated genes in these studies were extracted from the NCBI GEO (Gene Expression Omnibus) database, and the corresponding gene and transcript annotations were obtained from the Ensembl database. Overlaps between gene lists of interest were defined by the “Compare” tool available in IPA® (Ingenuity Pathway Analysis) software (QIAGEN Biotechnology, Venlo, Netherlands). IPA core analysis was used to define the top-ranked canonical pathways and molecular functions associated with HKGs. A flow diagram of the steps used to identify and validate HKG in this study is presented as Fig. 1.

Fig. 1
figure 1

Flow diagram of the steps used to identify and validate HKG genes in this study

Results

Identification of housekeeping genes

The mean number of reads with a quality score > Q30 obtained from the 30 biopsies ranged from 19 to 28 million, and yielded a total of 57,738 distinct reads that aligned to the hg19 human reference genome. After removing genes with an extracted expression value of zero in all biopsies, 47,613 transcripts remained for further consideration. Nine different HKG sets were created, one for each normalization method. Individual HKG expression accounted for only a small percentage of the total transcription activity in the samples. This is suggested by our calculation of expression ratios that represent mean normalized transcript counts of individual genes expressed as a proportion of the maximal transcript read count in the entire sample set. The numerical value of these expression ratios was less than < 0.05% for > 70% of the HKGs. (Table 1). The median coefficient of variation associated with most normalization methods was comparable (~ 0.3) except for the RPKM and TC methods where it was substantially higher (0.66 & 0.43 respectively) (Fig. 2a). The bias and variance of gene expression measurements was also the highest for these same two normalization methods (Table 1) indicating that the other methods tested by us provide much better data normalization. Similar results were obtained if CVs were calculated for the 42 HKG common to all normalization methods (Fig. 2b).

Table 1 Summary of HKG Datasets Defined in This Study Using 9 Different Normalization Methods
Fig. 2
figure 2

Box plots showing the median, first quartile, third quartile, and range of CV (coefficient of variance) for all 952 HKG defined by nine different normalization algorithms (a) and for the subset of 42 HKG common to all nine normalization methods (b) . a The median values (range) of CV in 952 HKGs defined by RPKM and TC are 0.67 (0.65–0.69) and 0.44 (0.41–0.45), respectively; whereas the mean values of CV defined by UQ, Median, Quantile, TMM, DESeq, TPM and Library size are 0.31 (0.28–0.33), 0.29 (0.27–0.31), 0.29 (0.26–0.31), 0.31 (0.29–0.33), 0.30 (0.27–0.32), 0.29 (0.27–0.31), 0.29 (0.26–0.31), respectively. b The median values (range) of CV in 42 common HKGs defined by RPKM and TC are 0.67 (0.65–0.69) and 0.43 (0.42–0.45), respectively; whereas the mean values of CV defined by UQ, Median, Quantile, TMM, DESeq, TPM and Library size are 0.28 (0.26–0.31), 0.25 (0.23–0.28), 0.25 (0.22–0.29), 0.26 (0.24–0.30), 0.25 (0.22–0.29), 0.25 (0.23–0.28), 0.25 (0.22–0.28), respectively. TC: total counts; UQ: upper quartile; TMM: trimmed mean of M-values; TPM: transcripts per kilobase million; RPKM: reads per kilobase per million mapped reads; e (see Materials and Methods section for details)

Validation of housekeeping genes

By definition, HKG maintain basic cellular functions and their expression does should not change in different disease states. This prediction was verified using public datasets for 4 common pathologic conditions in the kidney, namely, TCMR, antibody mediated rejection (ABMR), BKVN and chronic allograft injury manifesting as IFTA (Table 2). None of the 952 genes identified as HKG in this study were differentially expressed in human allograft biopsies with TCMR. For the gene lists associated with the remaining biopsy-diagnoses, an overlap of no more than 3 genes was seen with our HKG lists. This is remarkable since Gene Expression Omnibus data used in these comparisons was derived from more than 1000 biopsies.

Table 2 Overlapsa Between Gene Expression Datasets Derived from Diseased Allograft Kidney & HKG Defined in This Study

As an alternate approach to validating the HKG datasets obtained in this study, we compared the constituent genes with HKG lists defined by other investigators using varied technical approaches including expressed sequence tags, DNA microarray, RNA-seq, and massively parallel signature sequencing (Table 3) [29,30,31,32,33,34,35]. HKG derived from sequencing based technologies gave the largest number of genes (279 to 656) in common with our own RNA-seq derived gene list. There were fewer (80 to 117) genes shared with microarray technology-based lists. It is apparent that HKG gene identification can be affected by both the normalization method used as well as the technology applied to measure gene expression. The type of tissue analyzed is also an important variable. Whereas all our samples represent the allograft kidney, the aforementioned prior studies included multiple organs in their analysis. Thus, different HKG gene sets can be equally valid depending on the clinical setting and sample set being investigated.

Table 3 Comparison of Published housekeeping genes with HKG Datasets Defined in This Study

Pathway analyses

Ingenuity pathway analysis was performed on 42 genes common to 9 HKG sets derived from different normalization methods. The Entrez gene names and molecular functions of these genes are listed in Table 4. The majority are involved in chromatin, core promoter, DNA, mRNA, protein, or ATP binding, while others represent ubiquitous enzymes belonging to the protein kinase, phosphatase, protease, ligase, ATPase or GTPase family. Physiologic functions mediated by these housekeeping genes included regulation of the cell cycle, cell to cell signaling, post-translational modifications, cell morphology, cell movement, molecular transport, and lipid or nucleic acid metabolism (Figs. 3 and 4, Table 5). The top 4 canonical pathways identified all involved de novo or salvage pathways of pyrimidine biosynthesis, including pyrimidine ribonucleotides interconversion, pyrimidine ribonucleotides de novo biosynthesis, and pyrimidine deoxyribonucleotides de novo biosynthesis. Notably, less than 5% of the genes in these pathways met the criterion for being classified as a housekeeping gene. The majority of the remaining canonical pathways were related to protein signaling mediated by the protein kinase A, p38 MAPK, RhoA, CREB, ERK/MAPK, Eif4, p70S6K, IL-12, glucocorticoid receptor, estrogen receptor, or progesterone receptor pathways.

Table 4 Housekeeping Genes (n = 42) Common to All Normalization Methods
Fig. 3
figure 3

Canonical pathways identified by IPA core analysis as over-represented amongst 42 HKG common to 9 different data normalization methods. Pathways meeting statistical confidence thresholds preset in IPA are identified on the Y-axis (−log10 p = 1.3, right-tailed Fisher’s exact test). The lower X-axis and the line diagram display the proportion of total genes in the specified pathway that meet the cutoff criteria for identification

Fig. 4
figure 4

Top 20 physiologic functions associated with 42 HKG common to all biopsies and normalization methods. Physiological functions meeting statistical confidence thresholds (−log10 p = 1.3, right-tailed Fisher’s exact test)

Table 5 Canonical Pathways identified by IPA software for 42 Housekeeping Genes Common to All Normalization Methods

Discussion

The primary purpose of this study was to identify HKG appropriate for analyzing RNA-seq data derived from human renal allograft biopsies. It is expected that RNA-seq technology will be increasingly applied to discover molecular signatures relevant to the diagnosis, prognosis and therapy of diseases that commonly afflict kidney transplant recipients. The work performed has identified 9 HKG sets using different normalization methods and the question arises which gene set is most applicable to the analysis of gene expression data derived from renal allograft biopsies. Zyprych-Walczak et al. [36] analyzed transcripts from mammary epithelial cell lines, B-cells, and blood or bone marrow samples from patients with acute myeloid leukemia. They compared six normalization algorithms with respect to sensitivity, specificity, classification errors, and generation of diagnostic plots, and found that bias and variance were appropriate indices to compare the performance of different normalization methods. Application of this principle to our data indicates normalization using only the TC or RPKM methods is not advisable. The other normalization methods give essentially comparable results, although the quantiles method is nominally better than all the others that were tested. The basic idea behind RPKM is to normalize the reads first by total reads and then by gene length. Previous studies have confirmed the suboptimal performance of this method [17, 37, 38]. Interestingly, better performance was seen with TPM which differs from RPKM only in that normalization for gene length precedes correction for total reads. This reversal in the order of operations led to relatively uniform transcript counts in all 30 biopsies. However, in one prior study both TC and RPKM led to unsatisfactory results [39]. Two prior investigations noted that the quantile method is associated with lower variance in observed gene expression data, but there is a tradeoff that results in the introduction of some bias [19, 40]. Another study reported that DESeq method is the best for the normalization [39].

Our assessment of the published literature is that no single normalization method can be universally recommended for all data sets. HKG lists vary depending on study design, tissues analyzed, sequencing technology, normalization methods, as well as criteria and tools for housekeeping gene selection [41, 42]. Data distribution and analytic plans can influence the choice of normalization method: e.g. if most genes have low expression, upper quantile rather than median normalization should be preferred. On the other hand, if differential expression is to be performed by the DESeq program, the normalization algorithm incorporated in the software can work directly with unnormalized RNA-seq counts. Finally, we suggest that when working with renal allograft biopsies, the problem of choosing the right HKG set can be circumvented by using the list of 42 genes (Table 4) that is common to gene sets derived by 9 different algorithms.

The HKG proposed in this study have been validated with reference to publicly available external gene expression datasets obtained on an independent platform, namely, the Affymetrix DNA microarray analysis system. These latter datasets were derived from kidney transplant biopsies with TCMR, ABMR, BKVN or i-IFTA. A second observation that validates our HKG gene lists is that these share up to 656 genes with other RNA-seq derived gene lists in the literature. Finally, our IPA analysis is consistent with the proposed housekeeping function of these genes, and is concordant with putative cellular and biologic functions of other HKG reported in the literature. These reported functions include RNA processing, RNA splicing, DNA repair and mRNA metabolic processes [43], cell morphology and signaling, defense/apoptosis, ribosomal protein signaling/communication, structure/motility [44, 45], and biogenesis of nucleotides/amino acids and protein localization [35]. It is to be noted that some genes such as GAPHD and beta actin (ACTB), which are widely used in biological experiments as housekeeping controls, do not appear in our HKG set [46,47,48]. Likewise, our gene list does not include 8 genes that have been suggested to be suitable as a reference set for studies of the non-transplanted kidney [49].

Conclusion

In summary, we have developed several different HKG gene lists applicable to RNA-seq data derived from for human allograft kidney biopsies and processed by a variety of normalization methods. We have also assembled a universal set of 42 HKG that can be used without regard to the actual normalization procedure used. The study is limited by the small number of biopsies studied and use of formalin fixed paraffin embedded tissue, which may not be optimal to detect genes expressed at low abundance. However, low abundance genes have high variance and are not good candidates for the HKG designation. Importantly, the general bioinformatics approach that we have outlined is applicable to define HKG for RNA-seq datasets of any size and RNA quality for transplantation of all organs. Appropriate normalization of samples with a comprehensive set of HKG provides a mechanism to correct for batch effects, which can be a significant obstacle in the implementation of RNA-seq as a monitoring tool in the transplant clinic.