Introduction

Reference genes have been routinely used in gene expression analyses in traditional cancer studies1,2. Although one advantage of using reference genes is that their expression does not change under different physiological and experimental conditions3,4, numerous announcements have prohibited the use of routinely used reference genes blindly5,6. Furthermore, a groundbreaking analysis of RNA-seq data criticized the indiscriminate use of common reference genes7.

Glyceraldehyde-3-phosphate dehydrogenase (GAPDH) is a common reference genes in relative RT-qPCR experiments8. GAPDH was initially introduced as a suitable reference gene mainly due to its role in glycolysis; however, it is also involved in a variety of nuclear events such as transcription, RNA transport, DNA replication, apoptosis, nuclear translocation of proteins, and DNA repair9,10,11,12,13. The functional roles of GAPDH are not limited to cytoplasmic glycolysis, and more roles in the mitochondria and cytoskeleton have recently been discovered14. As a result, further investigation of GAPDH is required to determine its suitability for relative RT-qPCR data normalization. In this regard, we previously reported that SYMPK is a promising substituent reference gene among eight common reference genes, which include B2M, TBP, ACTB, HPRT1, PYCR1, GUSB and GAPDH15. To summarize, SYMPK had the lowest CqCV%, it was suggested by BestKeeper software in both normal and PTC tissues (r = 0.958 and 0.969, respectively) and SYMPK/ACTB had the lowest stability value = 0.209 according to the NormFinder algorithm. Finally, in addition to its statistical advantages, the SYMPK gene was proposed to normalize RT-qPCR data due to the lack of pseudogenes.

The target gene specificity and the sex-dependent behavior of reference genes were factors not previously considered in cancer studies. Fortunately, massive amounts of gene expression data are publicly available, allowing the selection of appropriate reference genes for any cancer study. As a result, we expanded on our laboratory findings in this study by conducting a precise and comprehensive bioinformatics meta-analysis. In our study population, routinely used reference genes were assessed in thyroid neoplasm subtypes in two scenarios: one that included patient sex consideration and the other that did not. GAPDH was not an appropriate reference gene in papillary thyroid cancer (PTC) tissues, as evidenced by our bioinformatics and lab-based experiments, because its expression was dependent on tumor subtypes.

We propose a novel approach for future cancer research: each target gene must have a unique reference gene(s). Then, using the NCBI gene expression omnibus, we created two gene lists: one for TCGA-PTC (with over 25,000 genes) and one for all thyroid neoplasm subtypes (GEO, with more than 6000 genes). An equation that emphasizes the mean and standard deviations of expression values from target genes was developed to accurately select reference genes.

Results

The workflow in Fig. 1 summarizes wet and dry lab procedures, including all laboratory experiments and in-silico analyses on datasets.

Figure 1
figure 1

Workflow for performing bioinformatics analyses and laboratorybased investigations.

Wet (laboratory) research

Quality and quantity of RNA

The mean absorbance ratios of wavelengths 260/280 and 260/230 were 1.96 ± 0.11 and 1.97 ± 0.06 for PTC tissues and their normal tissues, respectively. The intensity of 28S-rRNA bands was 1.5–2-times that of 18S-rRNA, indicating that the integrity of all extracted RNAs was satisfactory.

Target genes expression patterns

Three target genes, NKX2-1 (Gene ID: 7080), RTRAF (Gene ID: 51637), and ETS1 (Gene ID: 2113), had their expression levels compared between PTC and adjacent normal tissues. To generalize the findings, these three target genes are now referred to as A, B, and C. The gene names were removed because they were unimportant to us, but their perplexing expression pattern after normalization with reference genes was. The expression of the target genes was normalized separately with the commonly used reference gene, GAPDH, as well as our recently approved SYMPK (Fig. 2). When normalized against GAPDH or SYMPK, Gene A showed contradiction for 13 out of 17 PTC samples, whereas only 4 samples (PTC samples 2, 7, 16, and 17) did not show contradiction. In PTC sample 1, Gene A was normalized against GAPDH (red bar) and a negative delta-delta Cq ratio was observed. A positive delta-delta Cq ratio was also observed for gene A just when the gene was normalized against SYMPK (blue bar). The same holds true for gene A in PTC samples 3, 4, 5, 6, 8, 9, 10, 11, 12, 13, 14, and 15. Therefore, only when GAPDH was replaced with SYMPK did gene A show 76.5% inconsistency. Dissimilitude was also observed when gene B (Fig. 2B, 52.9% differences) and gene C (Fig. 2C, 29.4% differences) were normalized against GAPDH and SYMPK. Therefore, when a specific PTC tissue was compared to its adjacent normal tissue, target gene could be reported as overexpressed or downregulated at the same time.

Figure 2
figure 2

PTC tissues were compared to their adjacent normal tissues. Each target gene (gene A, gene B and gene C) was normalized once against "GAPDH" (red bands) and also a second time against "SYMPK" (blue bands). For Gene-A, 4 out of 17 samples (samples 2, 7, 16, and 17) show the same pattern after normalization against two different reference genes, while 13 samples (e.g. samples 1, 5, 6 and 10) show contradiction. For gene-B, 9 out of 17 samples and for gene-C, 5 out of 17 samples show contradiction. Positive and negative delta-delta Cq ratios respectively represent a target gene in PTC tissues that is down-regulated or over-expressed. Y-axis present deltadelta Cq ratios and X-axis show PTC samples.

SYMPK and GAPDH expression in normal and PTC tissues

The best way to normalize RT-qPCR data is to pick a reference gene or genes that exhibit the least amount of variation in mRNA expression across all of the samples. SYMPK (Fig. 3, blue bars) showed a narrower range of Cq values than GAPDH (Fig. 3, red bars) in normal tissues, and the same was true in PTC tissues, where SYMPK had less variance.

Figure 3
figure 3

Expression of SYMPK and GAPDH in normal and PTC tissues. In normal tissues, SYMPK (blue bars) exhibited a smaller range of Cq values than GAPDH (red bars), and this was also true in PTC tissues, where SYMPK had less variance.

Statistics on reference genes and target genes

Statistical analyses of target genes (A, B, and C) and reference genes (GAPDH and SYMPK) are presented in Table 1. SYMPK exhibited a lower SD = 1.74 and CqCV% = 5.84 in both the adjacent normal and the PTC tissues than GAPDH (SD = 4.26 and CqCV% = 16.32). To determine the differences in the expression values of GAPDH and SYMPK in the adjacent normal tissues as well as the PTC tissues, separate tissue statistics have been provided. In adjacent normal tissues, the mean Cq values of gene A (29.00) and gene B (28.74) were close to the mean Cq value of SYMPK (29.96) but far from the corresponding value of GAPDH (26.53). The same pattern was observed in PTC tissues, where the mean Cq value of SYMPK (29.59) was similar to that of genes A (28.44) and B (29.28), but not to GAPDH (25.69). In contrast to genes A and B, the mean expression of gene C in both adjacent normal tissues (26.24 and 26.53, respectively) and PTC tissue (25.8 and 25.69, respectively) was close to that of GAPDH. The SYMPK gene had a lower difference in expression between normal and PTC tissues (Cq = 29.96 and 29.59, respectively), whereas the GAPDH gene had a wider range of Cq values between normal and PTC tissues (Cq = 26.53 and 25.69, respectively). GAPDH gene expression, on the other hand, varied significantly more (3.85 < SD < 4.59) than target genes (2.65 < SD < 3.71). SYMPK, which had the lowest SD and CqCV% values in adjacent normal tissues, PTC tissues, and both tissues, was a better reference gene than GAPDH.

Table 1 Statistics for laboratory-collected RT-qPCR data.

Dry (bioinformatics) research

Inter-subtype comparisons

Fourteen microarray datasets with expression and phenotype data (Supplementary Table S2) were downloaded and cleaned (Materials and Methods). Because FVPTC (follicular variants of PTC) is the most common variant of PTC, FVPTC and PTC samples were analyzed as a single phenotypic group. For 6331 genes held in common, 520 samples were compiled, including 116 normal, 38 FTA (follicular thyroid adenoma), 246 PTC, 39 FTC (follicular thyroid carcinoma), 27 PDTC (poorly differentiated thyroid carcinoma), 52 ATC (anaplastic thyroid carcinoma), and 2 MTC (medullary thyroid carcinoma).Microarray probes were matched to corresponding genes, mean expression values for a probe set were calculated for each gene, and the data was subjected to “removeBatchEffect” (Supplementary Figs. S1 and S2).

The expression levels of eight common reference genes were compared in two ways: between normal tissues and each subtype of thyroid cancer, as well as between subtype (Table 2). GAPDH and SYMPK had effect sizes (ES) of 0.235 and 0.151 for the PTC subtype, respectively, when compared to normal tissues; however, the ES of GAPDH was statistically significant (p = 0.0020). GAPDH had statistically significant ES values in both the FTC (p = 0.0012) and ATC (p = 3.19E−17) subtypes. Furthermore, GAPDH had higher ES values than SYMPK in ATC (0.652 vs 0.070 respectively) and FTC (0.389 vs 0.154, respectively) subtypes. Other subtypes, such as FTA, PDTC, and MTC, showed negligible differences between GAPDH and SYMPK expression. GUSB (− 0.024), ACTB (0.032), and HPRT1 (0.037) were the three most ideal reference genes in the PTC subtype, with the lowest insignificant ES (Fig. 4A,B). The best three reference genes for other subtypes were SYMPK (0.070), TBP (− 0.076) and GUSB (− 0.098) in ATC (Fig. 4C,D); ACTB (0.036), HPRT1 (0.063), and GUSB (0.064) in FTC (Fig. 4E,F); GUSB (0.023), HPRT1 (− 0.052), and TBP (− 0.062) in FTA (Fig. 4G,H); GUSB (0.062), HPRT1 (− 0.075), and PYCR1 (− 0.102) in PDTC (Fig. 4I,J); ACTB (0.015), B2M (0.027), and TBP (− 0.070) in MTC (Fig. 4K,L).

Table 2 Analyses of differential expression between normal tissues and thyroid cancer subtypes, as well as inter-subtype comparisions.
Figure 4
figure 4

Volcano plots of differentially expressed genes and selected reference genes in each subtype in a microarray inter-subtype meta-analysis. (A) all genes and (B) selected reference genes of PTC versus normal analysis. (C) all genes and (D) selected reference genes of ATC versus normal analysis. (E) all genes and (F) selected reference genes of FTC versus normal analysis. (G) all genes and (H) selected reference genes of FTA versus normal analysis. (I) all genes and (J) selected reference genes of PDTC versus normal analysis. (K) all genes and (L) selected reference genes of MTC versus normal analysis.

The inter-subtype analysis was divided into two parts: the first assessed the differential expression of reference genes between undifferentiated (ATC) subtype and all other subtypes, and the second part was devoted to assessing the differential expression of reference genes between the poorly differentiated (PDTC) subtype and differentiated subtypes (FTA, PTC, FTC, MTC). GAPDH had statistically significant differential expression between ATC and all other subtypes, with the exception of FTC (0.262) and MTC (0.354). When undifferentiated-ATC tissues were compared to differentiated-PTC tissues, the genes GAPDH, ACTB, B2M, HPRT1, and PYCR1 were found to be significantly expressed. The same results were obtained when comparing undifferentiated-ATC tissues to poorly differentiated-PDTC tissues. A gene expression analysis was also performed to compare PDTC to other differentiated subtypes, and none of the reference genes were statistically significant. As a result, only a comparison of PDTC with FTA was reported in Table 2 and the others were omitted.

Intra-sex analyses, as well as sex-subtype interactions

Intra-sex analysis was performed to determine the differentially expressed reference genes in each of the two sexes, and the interaction of sex and subtype was investigated using factorial designs (Table 3). We dealt with 253 samples, including 44 normal, 15 FTA, 119 PTC, 24 FTC, 27 PDTC, and 24 ATC, after 6 out of 14 datasets failed to offer detailed information regarding the sex of the patients. We did not have any FTA-male samples, and no MTC subtype samples were left. Most of the reference genes did not reveal statistically significant differences in expression in intra-sex analysis. The only exceptions were ATC-women, who had statistically different expression of B2M (ES = 0.536, p = 0.0175) and PYCR1 (ES = 0.900, p = 0.0290) genes. The ES value of GAPDH was higher in females than males in PTC subtype (ES = 0.222 vs ES = 0.028 respectively), but the difference was not statistically significant (ES.Female–ES.Male = 0.194, p = 1), according to the interaction analysis. There were also differences in the expression of some other reference genes between females and males (e.g. TBP in ATC and B2M or GUSB in FTC), but using a factorial design to calculate the differences in differential expression revealed no significant differences in the expression of these two genes (p = 1 and 1 or 0.4560, respectively).

Table 3 Intra-sex analyses, as well as sex-subtype interaction.

The ES of reference genes were depicted in females and males based on their subtypes (Fig. 5-1,2 respectively). ACTB was the best reference gene in women with PTC (Fig. 5-1A\B) and FTA (Fig. 5-1I\J) subtypes, while B2M was the best in FTC-women (Fig. 5-1C\D), PYCR1 was the best in PDTC-women (Fig. 5-1E\F), and TBP was the best in ATC-women (Fig. 5-1G\H). In males with PTC (Fig. 5-2A\B), PDTC (Fig. 5-2E\F), and FTC (Fig. 5-2C\D), HPRT1 was the best reference gene, while SYMPK was the best in males with ATC (Fig. 5-2G\H).

Figure 5
figure 5

(1) Volcano plots of differentially expressed genes and selected reference genes in female samples in a microarray meta-analysis. (A) all genes and (B) selected reference genes of PTC versus normal analysis. (C) all genes and (D) selected reference genes of FTC versus normal analysis. (E) all genes and (F) selected reference genes of PDTC versus normal analysis. (G) all genes and (H) selected reference genes of ATC versus normal analysis. (I) all genes and (J) selected reference genes of FTA versus normal analysis. (2) Volcano plots of differentially expressed genes and selected reference genes in male samples in a microarray meta-analysis. (A) all genes and (B) selected reference genes of PTC versus normal analysis. (C) all genes and (D) selected reference genes of FTC versus normal analysis. (E) all genes and (F) selected reference genes of PDTC versus normal analysis. (G) all genes and (H) selected reference genes of ATC versus normal analysis.

Intra-subtype, inter-sex analysis

With the exception of FTA, inter-sex analysis was performed within subtypes to determine the most appropriate reference gene in different pathological conditions (normal and subtypes, Table 4). TBP, PYCR1, and B2M were the best reference genes in normal tissues (Fig. 6A,B), while ACTB, TBP, and HPRT1, were the best ones in PTC subtypes (Fig. 6C,D). HPRT1, SYMPK, and TBP were the best genes for the FTC subtype (Fig. 6E,F), HPRT1, ACTB and GAPDH for the PDTC subtype (Fig. 6G,H), and B2M, GUSB, and ACTB for the ATC subtype (Fig. 6I,J).

Table 4 Combined analysis of intra-subtype and inter-sex microarray data.
Figure 6
figure 6

Volcano plots of differentially expressed genes and selected reference genes in a microarray meta-data based on intra-subtype and inter-sex analysis. (A) all genes and (B) selected reference genes of normal female versus normal male analysis. (C) all genes and (D) selected reference genes of PTC female versus PTC male analysis. (E) all genes and (F) selected reference genes of FTC female versus FTC male analysis. (G) all genes and (H) selected reference genes of PDTC female versus PDTC male analysis. (I) all genes and (J) selected reference genes of ATC female versus ATC male analysis.

Microarray and RNA-seq data statistics

The TCGA database was used to download raw expression counts of 560 samples, including 502 PTC and 58 normal tissues, and the statistics of this RNAseq data are shown in Table 5. ACTB (2.89), GAPDH (3.08), and SYMPK (3.25) were the top three genes in PTC tissues with the lowest CV% values. In normal tissues adjacent to PTC tissues, SYMPK (CV% = 2.84) was ranked after GAPDH (2.46) and GUSB (2.59). According to the differential expression of the reference genes (Table 6), the top three genes with the lowest ES values were ACTB (− 0.001), TBP (− 0.017), and SYMPK (0.034), respectively. GAPDH had the highest ES value = 0.06 among eight reference genes (Fig. 7).

Table 5 TCGA dataset statistics for eight selected reference genes.
Table 6 TCGA differential expression analysis in PTC samples.
Figure 7
figure 7

TCGA volcano plots of differentially expressed genes and selected reference genes in the PTC subtype. (A) all genes and (B) selected reference genes of PTC versus normal analysis.

Table 7 shows statistics for microarray pooled data from adjacent normal tissues and each thyroid cancer subtype. While GAPDH was ranked fifth (3.37), the genes with the lowest CV% values in normal tissues were GUSB (2.77), B2M (2.86), and SYMPK (3.10), respectively. GUSB (2.48), GAPDH (2.56), and ACTB (2.86) had the lowest CV% values in PTC tissues, followed by SYMPK (3.38).

Table 7 GEO microarray dataset statistics for eight selected reference genes.

To facilitate use, the basic statistic for all 6331 genes in the GEO dataset (Supplementary Table S3) and all 25,705 genes in the TCGA dataset (Supplementary Table S4) were provided. These two tables compare the mean and standard deviation values of prospective target genes with the statistics of candidate reference genes.

Discussion

In research and clinical detection, RT-qPCR is the gold-standard method for expression evaluation16,17,18. The advantageous of RT-qPCR include high sensitivity and specificity, speed of analysis, and real-time monitoring of results8. Nature protocols require that appropriate internal reference gene(s), formerly known as housekeeping genes, be validated prior to each study19,20. Historically, an ideal reference gene has minimally altered expression under various pathological and physiological conditions such as tumour type and patient sex. It must be free of pseudogene(s) and alternative splicing15. We previously investigated eight reference genes and discovered that SYMPK was more stably expressed than conventional reference genes (GAPDH and ACTB) and also lacked pseudogenes15. Ribosomal RNA (18S rRNA) is a highly recommended reference gene for RT-qPCR data normalization21,22. Unfortunately, 18S rRNA has at least three drawbacks: inhibition by mitomycin C23, absence in bulk high-throughput expression platforms, and a clear role in cancer development24,25,26,27,28 and prognosis29. We did not include 18S rRNA in our study due to the aforementioned facts and a previous report about its unstable expression30.

GAPDH and SYMPK were used as reference genes to normalize three candidate genes to better understand the consequences of using inappropriate reference genes. GAPDH was chosen because it is the most commonly used reference gene in molecular biology, and we previously reported it as the worst reference gene using NormFinder algorithm15. This is in line with a previous study that found GAPDH to be unsuitable for normalizing relative RT-qPCR data from bladder and colon cancer31. The gene did not meet the criteria of those authors (e.g. tissues stability, expression level above background, and lack of alternative splicing), so it was eventually ignored despite being ranked in colon cancer.

In this study, the expression of reference genes (GAPDH and SYMPK) was compared between normal and PTC tissues, SYMPK was found to be a better reference than GAPDH because it had less variability. Aside from the lack of alternative splicing, lower CqCV% values for SYMPK gene were obtained from relative RT-qPCR data in both normal and PTC tissues. The main point of contention is that GAPDH had a significantly higher SD than the target genes, a flaw that makes it decidedly inappropriate for mRNA expression normalization. We performed a meta-analysis on GEO microarray data combined with a comprehensive TCGA RNA-seq data analysis to increase the sample size, include all thyroid cancer subtypes, and involve both sexes. We discovered that GAPDH was significantly upregulated in PTC, FTC, and ATC, and as a result, the gene is unsuitable as a reference gene according to the microarray meta-analysis. GAPDH was found to be significantly upregulated at various stages of tumor differentiation. This idea suggests that GAPDH may be a key promoter of tumor aggressiveness, as previously reported by Chiche et al. in non-Hodgkin’s B lymphomas32. They proposed that the increased GAPDH levels activated the nuclear factor-κB gene, which in turn increased the activity of hypoxia-inducing factor-1α (HIF-1α). In this study, when FTA and ATC subtypes were compared, the expression of HIF-1α was also upregulated (ES = 0.497, p = 0.0001).

We provided separate tables to assist researchers in accurately selecting reference genes for their study designs. For example, if researchers want to study different subtypes, Table 2 provides a list of genes, and the gene with an ES closer to zero is the best fit for their research. Researchers could use Table 3 to include the gender of patients in an analysis, and the best genes are those with ES.Female -ES.Male closer to zero. Table 4 is the best reference when a specific subtype is required as well as the gender of the patients, with genes with ES values closer to zero serving as the best reference genes.

Furthermore, a discrepancy was discovered when each target gene was normalized against two different reference genes, SYMPK and GAPDH. (Fig. 2 and Table 8). We hypothesized that the differnce was occurred because of the overlap between the Cq values of target and reference genes. By overlapping, we mean that the Cq values of the reference and target genes are within the same range, and thus samples with positive ddCq mutually neutralize samples with negative ddCq, resulting in a change in the overall expression pattern of a target gene (Supplementary Figs. S3). To solve the issue arising, researchers should use Eq. (1) when they are trying to select reference genes.

$$ {\text{abs }}\left( {\mu {\text{T}} - \, \mu {\text{R}}} \right) \, \ge \, \left( {{3}\sigma {\text{T }}} \right) + \left( {{2}\sigma {\text{R}}} \right) $$
(1)

abs: absolute value, µ: mean, σ (SD): standard deviation, T: Target gene expression in each subtype, R: Reference gene expression in each subtype.

Table 8 Single PTC sample analysis using ddCq method.

Consider the case where a reference gene has no variation in its expression (σ = 0) and a target gene has σ = 1. If the difference in mean expression between the target and the reference is at least three times the absolute value of the target gene's SD (3σT), the reference gene does not overlap with the target gene (Supplementary Fig. S4a). However, a reference gene with an SD value of 0.25 necessitates a difference of at least 3.5 units between the reference and target genes' mean expression values (Supplementary Fig. S4b). By doubling the SD value of the reference gene (from 0.25 to 0.5 and from 0.5 to 1), the mean expression values of the reference and the target genes must differ by 4 (Supplementary Fig. S4c) and 5 units (Supplementary Fig. S4d) respectively. To avoid overlap, we found that twice the absolute value of the SD of the reference gene (2σR) must also be considered for the calculation of the difference between the mean expression of the reference and the target genes. Therefore, it is possible to avoid overlaps between the expression values of reference gene and target gene and stop contradictory gene expression patterns by using Eq. (1).

For all expressed genes in GEO and TCGA, we provided tables with basic statistics, such as mean and SD (Supplementary Tables S3, S4). Our expression data could be a reliable estimate of any population for researchers to compare the mean and SD of desired genes in the above equation because our analyses include large sample sizes representing multiple ethnicities and subtypes in both sexes.

In conclusion, selecting reference gene(s) solely on the basis of specific tissues may result in inaccurate or misleading information. We questioned the common practice of selecting traditional reference genes. In a comprehensive investigation of thyroid cancer subtype, we discovered that GAPDH was significantly influenced by the aggressiveness of thyroid tumor subtypes. We created a new equation to help researchers choose the best reference gene(s) based on their desired target genes.

Materials and methods

Ethics statement

All patients who had PTC prior to surgery were given thorough explanations about sampling procedures, anonymous data publication, and rights of the subjects. All participants signed written informed consent forms. Tissues were not included in the study if any patient refuse to participate. This study was approved by the Isfahan University ethical committee's institutional review board (IR.UI.REC.1398.058). All experiments and procedures in this study, including but not limited to human participants, were carried out in accordance with the 1964 Helsinki Declaration and its subsequent amendments or comparable ethical standards.

Human tissue acquisition

Seventeen PTC tissues and their adjacent normal tissues were taken from patients undergoing total or partial thyroidectomy et al. Zahra and Sina hospitals in Isfahan, Iran. Approximately 50 mg of freshly dissected PTC tissues and adjacent normal tissues were immediately submerged in 1 ml RNAlater, RNA Stabilization Reagent (Qiagen, Hilden, Germany) and incubated at 4 °C for 24 h per the manufacturer’s instructions. Tissue samples were then briefly centrifuged to remove any residual RNAlater before being stored at − 80 °C for further analysis. The hospital or third-party laboratories performed postoperative histopathological analyses and pathological approval. Pathological staging was reported using the American Joint Committee on Cancer Tumor-Node-Metastasis (TNM) staging system, 7th edition.

RNA extraction and assessment

Total RNA was extracted from RNAlater-treated samples using a one-step RNA extraction reagent (Bio Basic, Markham, ON, Canada), as directed by the manufacturer. The concentration of isolated RNA was determined using a NanoDrop OneC spectrophotometer (Thermo Scientific, Waltham, MA, USA). A260/A280 and A260/A230 ratios were used to determine RNA purity. The integrity of the RNA was determined using 1.0% agarose gel electrophoresis.

Complementary DNA (cDNA) synthesis

DNase I treatment (Thermo Scientific, Bremen, Germany) was used to remove residual genomic DNA contamination, as directed by the manufacturer. One microgram of total RNA was reverse transcribed in a total reaction volume of 20 μL using the Thermo Scientific RevertAid Reverse Transcriptase kit (Thermo Scientific, Bremen, Germany) according to the manufacturer’s instructions.

Design of exon-junction primers

To avoid amplifying genomic DNA and/or heterogeneous nuclear RNA, all primers were exon junctioned. Beacon Designer 8.1 (Premier Biosoft International, Palo Alto, CA, USA) was used to design primers that span specific exons. Oligo 7 was used to recheck the primers for any unwanted secondary structure (Molecular Biology Insights, Colorado Springs, CO, USA). The NCBI-primer BLAST service was used to confirm the specificity of the designed primers. The melting temperature of the primers was validated using temperature gradient PCR (Sinaclon Bioscience, Tehran, Iran). All of the information on the primer pairs is presented in Supplementary Table S1.

Relative RT-qPCR

In a Bio-Rad Chromo4 device (Bio-Rad, Hercules, CA, USA), a relative RT-qPCR reaction was performed using SYBR Green RealQ Plus 2 × Master Mix (Ampliqon, Odense, Denmark). The RT-qPCR reaction protocol consisted of (i) one cycle of enzyme activation and initial denaturation at 95 °C for 15 min, and (ii) 40 cycles of denaturation at 95 °C for 30 s, annealing for 30 s, and extension at 72 °C for 30 s. After each cycle, the plates were read. All relative RT-qPCR reactions were run in triplicate, with non-template control (NTC) per gene.

Melt curve analysis

To assess the specificity of relative RT-qPCR, the melt curve was constructed by observing the gradual rise of temperature in 1 °C increments from 55 to 95 °C, followed by plate reading. The temperature (°C, x-axis) was plotted against the derivative of fluorescence change over temperature (y-axis).

Gene expression analysis

Cq values were exported from the Bio-Rad Chromo4 thermocycler into Microsoft Excel (2013) for further analysis. The average of Cq values for reference and target genes in PTC tissues and adjacent normal tissues was calculated and the Livak method was used for normalization2. The delta Cq values were calculated by subtracting the Cq values of a reference gene and a target gene from each sample, and delta-delta Cq was determined by the difference between each PTC tissue and the average of delta Cq in adjacent normal tissues.

Statistical analysis

Microsoft Excel 2013 (Microsoft, Redmond, WA, USA) was used to calculate qPCR fold change, maximum Cq, minimum Cq, standard deviation (SD), mean Cq, and correlation of variation (CqCV%, CqCV% = SD/mean × 100). CV% is a statistical measure that represents the relative dispersion of gene expression values in a dataset, regardless of the mean expression values of the genes. It is used to circumvent the problematic investigation of SD without considering the overall expression.

Data collection

The GEO and The Cancer Genome Atlas (TCGA) databases were used to obtain microarray and RNAseq data, respectively. To scavenge any microarray expression data related to thyroid neoplasm, the GEO database was mined for the keywords “thyroid neoplasm”, “thyroid cancer”, and “thyroid carcinoma”. Exclusion criteria were used, and any data from species other than Homo sapiens was discarded. Cell lines, treatments, therapies, knocked-in and knocked-out models, and any dataset with incomplete phenotype information were excluded from further analysis. To reduce other biases, samples were collected from different countries and from people of various ethnicities. To compensate for the small sample size in different sexes and pathological subtypes, pooled data analyses were performed. As a result, 14 microarray datasets containing 520 samples were used in this study. FTA, PTC, FVPTC, FTC, MTC, PDTC, and ATC were among the thyroid neoplasms represented in the datasets. Microarray datasets are described in detail in supplementary Table S2.

Pooled data analysis and calculation of effect size

Although the protocols for microarray and RNAseq analyses differed, the first step was to perform single dataset quality controls. Box plots were used to validate the log2 transformation and quantile normalization. Outlier detection was accomplished through the use of hierarchical clustering based on the Pearson correlation coefficient (PCC) as well as principal component analysis (PCA). The expression data from the outlier-removed datasets was compiled, the batch effect was removed with the Limma package's “removeBatchEffect” command, and a PCA plot was generated. The Limma package was used to analyze the pooled data, and the effect size (ES) was calculated. The family-wise error rate (FWER) “bonferroni” method was used to correct P-values. The effect size with FWER < 0.05 was deemed significant. The best reference genes had the lowest ES and a non-significant p-value. For inter-subtype analysis (subtypes-normal, undifferentiated-differentiated, poorly differentiated-differentiated), intra-sex analysis (subtypes-normal, separately in females and males), and intra-subtype/inter-sex analysis (females-males, separately in each subtype), two groups models were built. Interaction analysis was also performed between male and female, and a factorial design was used to estimate the impacts of the individuals' sex at various levels of the cancer subtypes ((female.tumor-female.normal)—(male.tumor-male.normal)).

The edgeR package was used to calculate logFC and FWER corrected P-values from TCGA raw read counts. GAPDH and SYMPK were two of the eight reference genes, with the remaining six being GUSB, ACTB, B2M, TBP, PYCR1, and HPRT1. For all the analyses, the software platform R 4.0.1 (R Foundation 3.6.2 for Statistical Computing, 2020, Austria) was used.

GEO and TCGA datasets statistics

Using the RStudio environment, maximum, minimum, SD, mean, and CV% were calculated from the expression values of the selected genes in both the microarray pooled data and the TCGA. After compiling the expression data for each cancer subtype separately, statistical terms were calculated for each row representing each gene. A total of 6331 genes from microarray pooled data analysis output and 25705 genes from TCGA analysis output were statistically analyzed.