Background

DNA microarrays and RNA sequencing (RNA-Seq) have become indispensable experimental tools for characterizing the effects of biological interventions on genome-wide gene expression (“transcriptomics”) [1, 2]. Applications of these tools have been transformative in many areas of biological research, including cancer biology, biomarker discovery, and drug target identification [3,4,5]. These applications often involve differential expression analysis: the isolation of differentially expressed (DE) genes between healthy and disease conditions. Knowledge of DE genes facilitates the discovery of causative genes and gene pathways for a disease of interest. For example, many studies of carcinogenesis focus on identifying the genes directly responsible for promoting cancer occurrence (“driver genes”) out of all DE genes [6] . Furthermore, DE gene identification is an important first step for disease biomarker discovery. The discovery of biomarkers from transcriptomics data typically involves selecting the most discriminative genes between a healthy and diseased state or between different disease states [7] . A comprehensive list of DE genes provides a biologically plausible set of candidates for these discriminative genes and can greatly streamline the search [8]. Common applications of transcriptomics-derived biomarkers include predicting diagnosis, prognosis, and therapeutic response for a disease of interest through a process known as supervised classification [9]. In this context, DE gene identification can be viewed as a means of performing feature selection for classification. In general, feature selection is a process for dimensionality reduction that removes redundant or irrelevant features (genes), reduces classification model complexity, and improves classification performance [10].

Despite their widespread use for DE gene identification, transcriptomics data are notorious for their inclusion of technical and biological noise [11]. This noise complicates differential expression analysis by reducing the accuracy of DE gene identification relative to other assays (e.g., real-time or quantitative PCR [12]), lowering the reproducibility of experiments conducted on different platforms [13], and reducing the statistical power associated with the detection of DE genes at a particular fold change [14]. A straightforward strategy for mitigating the effects of noise is to increase the number of replicates assayed (“sample size”) for each condition of interest. However, this practice can be cost prohibitive or even impossible for conditions with limited sample availability. Furthermore, even with larger sample sizes, transcriptomics data pose a considerable challenge to feature selection methods due to the curse of dimensionality. Specifically, it is well known that optimal fitting of classification models (including the selection of features) breaks down when the feature dimensionality is substantially larger than the sample size [15].

One promising solution for the above challenges is to incorporate prior biological knowledge into differential expression and feature selection analyses [16]. This Bayesian approach can mitigate problems associated with a small sample size [17], while also improving biological interpretability of the resulting DE genes/features [10]. Prior biological knowledge for transcriptomics data can take several forms, including pre-existing transcriptomics data from other studies, data from complementary high-throughput assays (e.g., chromatin immunoprecipitation or protein-protein interactions), and gene functional annotation (e.g., Gene Ontology [18, 19] or KEGG [20, 21] ). For the purposes of this study, we will focus on the first type of knowledge, although we note that analytical methods are available to incorporate the other types as well [22, 23]. Thanks to functional genomics repositories like the Gene Expression Omnibus (GEO)[24, 25] and ArrayExpress [26], transcriptomics data from over 2.5 million samples are publicly available. Furthermore, the size of this resource is growing exponentially, with numbers of samples in GEO doubling every 3–4 years.

Over the last 15 years, a number of methods have been developed that use prior knowledge in the form of transcriptomics data to inform differential expression analyses [27,28,29,30,31]. However, these methods typically either ignore the identities of the many experimental conditions in the pre-existing data, or they do not leverage these identities to create a rigorous statistical metric for identifying DE genes. For example, the SVD Augmented Gene expression Analysis Tool (SAGAT) uses singular value decomposition (SVD) to extract transcriptional modules from pre-existing DNA microarray data [27]. These modules, which contain no information regarding assayed conditions, are then incorporated into a statistical analog of the two-sample t-test to improve the accuracy of DE gene identification. In contrast, a very recent study made direct use of the experimental conditions in pre-existing data to characterize empirical prior probabilities of differential expression [31]. However, although these prior probabilities were predictive of differential expression patterns, they were not explicitly utilized in a Bayesian statistical framework for identifying DE genes. Relatedly, although there have been many studies contributing novel or adapted feature selection methodologies for classification of biomedical data [32,33,34,35,36], to our knowledge no method combines an experimental condition-aware analysis of pre-existing data with a statistically principled means of feature selection.

To address these shortcomings, we propose a novel differential expression and feature selection approach—GEOlimma—that leverages pre-existing GEO-derived transcriptomics data. As described below, our proposed method modifies the popular Linear Models for Microarray and RNA-Seq Data (“Limma”) method [37, 38] . Specifically, GEOlimma incorporates empirical prior probabilities of differential expression (DE prior probabilities) in a Bayesian statistical test for DE genes. We first describe the computation and biological characterization of DE prior probabilities from a large collection of pre-existing DNA microarray experiments from GEO. Next, we apply GEOlimma and Limma to four benchmark differential expression comparisons from two validation datasets. Our results demonstrate a substantial increase in experimental power for identifying DE genes due to use of GEOlimma. Finally, we explore GEOlimma’s ability to improve feature selection for classification across the four benchmark comparisons.

Results

In this study, we developed a gene expression feature selection method, GEOlimma, in which gene-level differential expression (DE) prior probabilities were derived from large-scale microarray data freely available from the Gene Expression Omnibus (GEO). We first explored enriched biological pathways in genes with either high or low DE prior probabilities. We then applied GEOlimma to DE analysis and supervised classification tasks on a collection of four validation datasets.

Biological analysis of DE prior probabilities

The goal of differential expression analysis is to identify differences in gene expression across biological conditions in order to discover functional genes and pathways involved in a biological process of interest. The Limma method [39] is an empirical Bayesian approach for identifying DE genes that has been widely applied. However, an important limitation of this method is that the prior probabilities for differential expression are set to be constant for all genes. This implies that all genes have the same chance of being expressed differently, which is not biologically realistic [31]. Therefore, we developed and applied GEOlimma, which uses a large collection of GEO datasets to compute gene level DE prior probabilities (see “Methods” section). We first downloaded the 602 GEO DataSets (GDS) currently available from the GPL570 platform (Affymetrix Human Genome U133 Plus 2.0 Array), followed by performing pairwise DE analysis among the largest possible collection of non-overlapping sample groups (number of samples \(\ge\) 5) from each GDS experiment. We identified DE genes using a Benjamini–Hochberg false discovery rate (FDR) threshold of 0.05. By repeating this procedure for every GDS, we calculated DE frequencies for 21025 distinct Entrez genes (20,283 genes with unique gene mappings) across all experiments (2481 pairwise comparisons total) and converted these to prior probabilities of DE. Given gene-level DE prior probabilities, we can then compute posterior probabilities of DE for a given biological experiment using Bayes theorem. Figure 1 shows distributions of pairwise comparisons across GDS datasets (A) as well as DE prior probabilities (B), the latter ranging between 0.0048 and 0.1769 with two apparent modes. We note that the largest number of comparisons afforded by a single GDS (171) is less than 7% of the 2481 total comparisons, with the vast majority of GDS datasets providing fewer than 25 comparisons each. The median probability observed is 0.069, which we note is roughly seven times higher than the default constant prior probability used by Limma (0.01). Additional file 1: Fig. S1 lists the top most frequently DE genes, including TUBA1A (tubulin alpha 1a), CD24, and SERPINB1 (serpin family B member 1), with DE prior probabilities of 0.1769, 0.1761, and 0.1693, respectively. The three least frequently DE genes were LOC102725116, TMCO5A (transmembrane and coiled-coil domains 5A), and LINC01492 (long intergenic non-protein coding RNA 1492), with DE prior probabilities of 0.0048, 0.0056, and 0.0060, respectively. Generally speaking, we hypothesize that genes with high prior probabilities of DE are more likely to be implicated in human disease and thus could function as potential biomarkers, while those with low DE prior probabilities represent genes either not expressed under most of the observed conditions or constitutively expressed genes that are required for the maintenance of basic cellular functions (i.e., housekeeping genes).

Fig. 1
figure 1

GEO Dataset comparisons and prior distribution. a Histogram of 2481 pairwise comparisons made within 602 curated GEO Datasets. b Histogram of DE prior probabilities for 20,283 genes made from the pairwise comparisons in A

In order to improve our biological understanding of the calculated DE prior probabilities, we performed gene set enrichment analysis (GSEA) based on KEGG pathways with the top 500 most and least frequently DE genes, respectively. Table 1 lists significantly enriched pathways (BH-adjusted p-value \(\le\) 0.05), which include 19 pathways from the most frequently DE genes and 4 from the least frequently DE genes. The most significant pathway in the former category is hsa04110: Cell cycle (adjusted \(p = 7.83\)E−08); Fig. 2 illustrates the frequently DE genes mapped in this pathway. Two additional pathways in this category directly related to cell growth and death include hsa04115: p53 signaling pathway and hsa04210: Apoptosis. We also identified six cancer-specific frequently DE pathways: hsa05222: Small cell lung cancer, hsa05206: MicroRNAs in cancer, hsa05218: Melanoma, hsa05202: Transcriptional misregulation in cancer, hsa05205: Proteoglycans in cancer, and hsa05220: Chronic myeloid leukemia. Finally, the two frequently DE pathways hsa04068: FoxO signaling pathway and hsa04668: TNF signaling pathway function in Signal transduction. We note that signal transduction pathways are involved in cell death mechanisms that function in colorectal carcinogenesis progression [40].

The 4 least frequently DE pathways include two sensory system pathways: hsa04740: Olfactory transduction and hsa04742: Taste transduction, Signaling molecules and interaction pathway. The other two significant pathways in this category were hsa04080: Neuroactive ligand-receptor interaction and hsa05320: Autoimmune thyroid disease. Our results suggest that genes belonging to these pathways are either not expressed under most circumstances or show relatively stable expression across many biological conditions.

Table 1 KEGG enrichment analysis of top 500 genes with high and low DE prior probabilities
Fig. 2
figure 2

Significantly enriched Cell Cycle pathway from genes with high DE prior probabilities. The red shaded blocks indicate genes with high prior probabilities

GEOLimma method application on four validation datasets

We investigated the utility of gene-specific DE prior probabilities by performing DE analysis with GEOlimma in four evaluation datasets. Specifically, we selected two GEO series—GSE8052 and GSE15061—from platform GPL570 that enabled four DE comparisons to be made. Importantly, neither of these datasets was represented by a GEO GDS, meaning that none of the resulting comparisons were involved in DE prior probability computation. The four comparisons include Asthma vs Non-asthma (GSE8052) and three comparisons from GSE15061: Nonleukemia (Nonleuk) vs Myelodysplastic syndrome (MDS), Nonleuk vs acute myeloid leukemia (AML), and AML vs MDS. The probes for each of the datasets are represented by 20,283 genes with unique mappings. Any genes without available DE prior probabilities were assigned the median value of all prior probabilities. We first identified DE genes using GEOlimma as well as the standard Limma method. This allowed us to compare the two methods, as well as characterize the extent of differential expression present in each comparison. For Limma, we considered genes to be DE if their BH-adjusted p value \(\le\) 0.05. In contrast, as GEOlimma enables the calculation of a modified B score only (see “Methods”), we selected a B score threshold for GEOlimma significance based on the smallest Limma B score for which the Limma adjusted p value \(\le\) 0.05. Using these criteria, we identified DE genes based on all relevant samples for each of the four comparisons described above. To assess the effect of small sample sizes on GEOlimma/Limma performance, we also randomly sampled 10 subsets of 40 samples (20 in each class) for each comparison and calculated the mean and standard deviation of the number of DE genes across these subsets using both methods. Table 2 lists details of each DE comparison along with summaries of our analysis results using both Limma and GEOlimma. We note that for the Asthma comparison, there are no significant DE genes based on all samples (as well as in subsets) using the Limma method. Therefore, we were not able to quantify the number of DE genes for this comparison using GEOlimma. In the remaining three comparisons, our results demonstrate that GEOlimma identifies more DE genes than Limma when applied to either all samples or 40-sample subsets. Figure 3a helps illustrate why this is, by examining the distributions of Limma and GEOlimma B scores for the Asthma comparison. Despite the lack of significant DE genes in this comparison, use of GEOlimma results in a wider B score distribution with a marked shift to higher values compared to Limma. This difference is due to the diverse set of gene-specific DE prior probabilities used by GEOlimma, the median value of which is substantially higher than the constant value used by Limma. The potential increase in numbers of DE genes identified by GEOlimma also suggests that use of a small constant DE prior probability may result in overly conservative DE gene identification. In our PCA and t-SNE visualizations of all samples (Fig. 3b, c), we note the lack of clear separation between the Asthma and Non-asthma groups, which helps explain why no significant DE genes were detected. Figure 3d–f show the same information for a randomly selected subset of 40 samples. We note that the B scores have a similar distribution as that of all samples.

Table 2 Differential expression comparison details and Limma and GEOlimma DE gene counts for all samples and 10 subsets of 40 samples of each comparison
Fig. 3
figure 3

B score change and sample visualizations of asthma dataset. The top figures are generated from all samples; the bottom figures are drawn using a random subset with 40 samples. a and d depict Limma and GEOlimma B score distributions of all genes, b and e show PCA visualizations of samples, and c and f show t-SNE sample visualizations

When looking at the top 20 most significantly DE genes for each comparison, we noted that use of GEOlimma changes the order of these genes compared to Limma, with an overall higher average B score (Additional file 2: Fig. S2). To further explore this phenomenon, we counted the genes in common for the top 100–1000 most significantly DE genes between GEOlimma and Limma across 10 randomly selected 40-sample subsets for each comparison. The average overlap percentages were 67.3% for the Asthma comparison, 87% for Nonleuk vs MDS, while over 95% for both AML vs MDS (95.2%) and Nonleuk vs AML (95.5%) (Additional file 3: Fig. S3). These results suggest that GEOlimma DE prior probabilities have a larger effect on the resulting DE gene list for datasets showing a more modest overall degree of differential expression (e.g., Asthma and Nonleuk vs MDS comparisons).

Before assessing the performance of GEOlimma in the four evaluation datasets, we first performed a simulation study to evaluate the effect of GEOlimma on false positive and false negative predictions (see “Methods” for details). Specifically, we used the madsim R package to simulate 50 synthetic datasets with the same distribution as the Nonleukemia vs AML comparison. We then evaluated the propensities of Limma and GEOlimma to make false positive and negative predictions in these datasets. Overall, we found very little difference between the methods, with average false positive rates (FPR) of 0.0106 and 0.00949 for Limma and GEOlimma, respectively. The average false negative rates (FNR) for these methods were 0.5002 and 0.5001, respectively. Additional file 4: Fig. S4 shows the gene-wise FPR and FNR distributions resulting from these experiments.

In order to explore the practical benefits of using GEOlimma, we then compared the accuracy of DE gene identification between GEOlimma and Limma for each of the four DE comparisons. For each comparison, we first performed DE analysis on all samples using Limma, with the resulting significant DE genes (n = 1241 [FDR \(\le\) 0.4], 2619 [FDR \(\le\) 0.05], 8610 [FDR \(\le\) 0.05], and 10975 [FDR \(\le\) 0.05] for the Asthma, Nonleuk vs MDS, Nonleuk vs AML, and AML vs MDS comparisons) being treated as the ground truth. We note that we relaxed the significance thresholds for the Asthma comparison in order to include a sufficient number of DE genes for subsequent evaluation. Next, we randomly generated non-overlapping sample subsets for each comparison based on the minimum sample size at which the group proportions of the dataset could be maintained. For example, as GSE8052 contains 66% Asthma and 34% Non-asthma samples, the smallest sample size considered was 6 (4 Asthma, 2 Non-asthma) in order to ensure \(\ge\) 2 samples per group. We then increased this sample size in increments of 3 to also consider subsets of 9, 12, and 15 samples. We then applied both GEOlimma and Limma on each of the sample subsets to determine which method best recovered the ground truth. Specifically, we used the R package ROCR [41] to compute areas under the receiver operating characteristic curve (AUCs) given the GEOlimma/Limma B scores and the ground truth. Figure 4 depicts the AUC improvement of GEOlimma over Limma for all four comparisons. Notably, GEOlimma consistently increases the average AUC for each of the subset sizes, with an overall average AUC improvement of 0.04. Furthermore, in the three comparisons made within GSE15061, GEOlimma increases AUC for every subset tested. Interestingly, the AUC improvement is largest for the smallest sample sizes evaluated and decreases slightly as sample size increases. This further supports the assertion that GEOlimma has a bigger impact on datasets with more modest expression differences (as would result from a small sample size). To confirm that these improvements result specifically from the DE prior probabilities learned using publicly available GPL570 data, we randomly shuffled the prior probabilities and repeated the above analysis. As seen in Additional file 5: Fig. S5, GEOlimma using randomized prior probabilities consistently decreases AUC compared to Limma.

To quantify the experimental power gained by using GEOlimma, we converted AUC values into effective sample size. Specifically, for each of the evaluation datasets, we first calculated AUCs resulting from applying Limma to all non-overlapping sample subsets ranging in size from the minimum number needed to maintain group proportions (described above) to the total number of replicates. For example, in the Asthma comparison we considered all subsets of size 6–402 in increments of 3. These AUCs enabled us to fit a “standard curve” for each comparison, from which we could interpolate the mean number of samples gained by using GEOlimma given initial numbers of 6, 9, 12, and 15 (Asthma) samples. Additional file 6: Fig. S6 presents the AUC standard curves and Table 3 summarizes the distribution of GEOlimma effective sample sizes for each comparison. Overall, GEOlimma leads to a substantial increase in mean effective sample sizes, particularly when applied to smaller subsets, where we observed gains of 157–288% for the smallest sample sizes evaluated for each comparison. The Asthma comparison shows the largest relative increases across all subsets, with the mean GEOlimma effective sample size more than doubling that of Limma even for the largest subset tested (m \(=\) 15). These results demonstrate the gains in experimental power for DE gene discovery that are possible with the use of GEOlimma.

Fig. 4
figure 4

Area under the ROC curve (AUC) improvement of GEOlimma over Limma for identifying DE genes from a range of data subset sizes: a Asthma vs Non-asthma comparison, b Nonleukemia vs AML comparison, c Nonleukemia vs MDS comparison, d AML vs MDS comparison

Table 3 Distribution of effective sample sizes from applying GEOlimma to a range of data subsets (size denoted by N) for four differential expression comparisons

Classification performance using GEOLimma feature selection method

Feature selection is a critical step in supervised classification for diagnosis, prognosis and treatment. Here we compare the abilities of GEOlimma and Limma as feature selection methods to perform accurate classification on the four evaluation datasets. To focus on the most challenging classification tasks for each comparison, we randomly sampled subsets of size 20 from each of the two groups. Specifically, we generated 10 pairs of subsets for training, with each pair containing 40 total samples (20 per group). In the same manner, we also generated an additional 10 pairs of samples for testing. During training, we performed tenfold cross-validation to estimate model performance. Given the large numbers of genes present in these datasets, we focused on the 1000 genes with the highest variance across all samples within each comparison. Within these 1000 genes, we selected the top 100–1000, in increments of 100, using either Limma or GEOlimma and performed classification using a logistic regression (LR) classifier. For each sampled subset, we applied a one-sided (hypothesis: GEOlimma AUC > Limma AUC) paired Wilcoxon test to compare the AUC differences between GEOlimma and Limma at each feature size (10 total). Because of the near perfect AUC observed for subsets of the AML vs MDS and Nonleuk vs AML comparisons, we only evaluated AUC differences for the Asthma and Nonleuk vs MDS comparisons using the Wilcoxon test. Table 4 shows the mean AUC differences of Asthma for each of the 10 pairs of subsets.Although many of the subsets do not show a significantly higher GEOlimma AUC, we note that the average GEOlimma - Limma AUC difference for both training and testing subsets is positive. Furthermore, subset pairs 7 and 9 show a significant GEOlimma AUC improvement in both training and testing subsets, while none of the negative AUC differences observed were significantly less than 0 (hypothesis: Limma AUC > GEOlimma AUC) in training sets. Figure 5 shows the GEOlimma and Limma AUC values at each number of features for subset pairs 7 (A) and 9 (B). For the Nonleuk vs MDS comparison, we find no significant differences between GEOlimma and Limma AUCs in training or testing subset pairs. Figure 5c shows one example of a training pair for this comparison. Overall, our results suggest that use of GEOlimma for feature selection can provide moderate improvements in classification performance for datasets with a modest overall degree of differential expression (e.g., Asthma comparison). For datasets with more pronounced degrees of differential expression, use of GEOlimma resulted in very similar classification performance compared to Limma.

Fig. 5
figure 5

Classification performance of data subsets using a logistic regression classifier with GEOlimma and Limma feature selection methods. The x-axis indicates the number of selected features; y-axis indicates classification AUC. The top three plots display training AUC values; the bottom three plots depict validation AUCs. a Asthma vs Non-asthma subset 7 AUCs, b Asthma vs Non-asthma subset 9 AUCs, c Nonleukemia vs MDS subset 9 AUCs

Table 4 Differences in classification performance (GEOlimma AUC–Limma AUC) for 10 data subsets of the Asthma comparison. Italic p values (Wilcoxon signed-rank test) denote statistically significant AUC improvements of GEOlimma over Limma

Discussion

In this study, we developed a differential expression feature selection method, GEOlimma, in which we calculated gene-level differential expression (DE) prior probabilities from large-scale GEO transcriptomics data and incorporated them into a Bayesian framework. In a DE analysis, GEOlimma detected a larger number of DE genes in four comparisons within two evaluation datasets, compared to Limma. By analyzing small sample subsets of each dataset, we showed that knowledge-driven GEOlimma substantially improved experimental power in terms of effective sample size. Furthermore, in a supervised classification analysis, GEOlimma used as a feature selection technique led to similar or better classification performance than standard Limma given noisy, small sample subsets from the Asthma comparison.

We also biologically characterized genes with especially high or low DE prior probabilities using KEGG pathway enrichment analysis. The strongest signal came from genes with high DE prior probabilities, where we detected enrichment in cell growth and death, signal transduction and cancer-related pathways. Cell growth and death are fundamental biological processes; however, deregulation of these processes is often involved in carcinosis. Specifically, resisting cell death and sustaining proliferative signaling were reported to be hallmarks of cancer [42]. This prevalence of enriched cancer-specific pathways may be indicative of an over-representation of cancer-related studies in data repositories such as GEO, which has been previously reported [28, 43]. However, while we saw excellent improvements in experimental power in differential expression analysis of three cancer-related comparisons, we note that the largest relative increases in effective sample size were observed in the Asthma comparison. This suggests that GEOlimma can also provide a substantial benefit to datasets that are unrelated to cancer.

We closely modeled GEOlimma after the widely-used differential expression analysis method Limma. Since its first publication nearly 15 years ago, papers describing the Limma method [38, 39, 44] have been cited over 10,000 times for applications in differential expression analysis of DNA microarray or RNA-Seq transcriptomics data. For the latter application, the more recently-developed voom method [44] adapts the Limma empirical Bayesian framework to read count data, which enables computation of posterior DE probabilities for RNA-Seq experiments. Although we only applied GEOlimma to DNA microarray data in this study, our approach is readily transferable to RNA-Seq data through the use of the voom methodology.

In this study, we made use of all available GPL570 GEO datasets (GDS), which we acknowledge represent a relatively small subset of all available GPL570 data at GEO. We made this selection in large part due to the high-quality curation of GDS datasets compared to the more abundant GSEs, which allowed us to easily perform multiple differential expression comparisons within each dataset. Given recent advances in natural language processing and the extraction of experimental metadata (e.g., [45] ), an exciting future direction is the automatic annotation and inclusion of the larger number of GSEs (5154 for GPL570 as of June 2019) in the DE prior probability calculations. Such an expansion of a pre-existing data collection would enable subdivision and calculation of condition-specific DE prior probabilities (e.g., stem cell-related or viral infection-related), which could further improve GEOlimma performance when applied to the analysis of related datasets. One final future direction is the generalization of GEOlimma DE prior probabilities from individual values to probability distributions. In this case, DE hyperprior parameters could be calculated from pre-existing data rather than explicit prior probabilities. This modification would enable a more nuanced adjustment of DE posterior probabilities by GEOlimma given the biological characteristics of the dataset of interest.

Conclusions

Overall, our results demonstrate that GEOlimma effectively utilized pre-existing transcriptomics data for improved differential expression and feature selection analyses. Due to its focus on gene-level differential expression, GEOlimma also has the potential to be applied to other high-throughput biological datasets.

Methods

GEOlimma method formulation

We developed the GEOlimma method by combining the widely-used differential expression (DE) analysis method Limma, which is typically used to analyze gene expression microarray and RNA-seq data and assess differential expression between biological conditions. Limma uses empirical Bayesian methods to provide stable DE predictions, which is particularly useful when the number of sample replicates is small. However, one simplifying assumption made by Limma is that the DE prior probabilities for each gene are identical (set 0.01 by default). GEOlimma combines the Bayesian nature of Limma with gene-level DE prior probabilities calculated from large-scale microarray datasets to better select genes that are biologically relevant to a comparison of interest.

The Gene Expression Omnibus (GEO) is a public data repository for high-throughput gene expression data including microarray and RNA-seq data [25]. GEO DataSets (GDS) are a subset of the repository that store curated gene expression datasets, along with the original data (GEO Series) and experimental platform information. GPL570, also known as the HG-U133_Plus_2 Affymetrix Human Genome U133 Plus 2.0 Array, is one of the best-represented human genome microarray platforms in GEO, with 149,049 samples available (as of June 7, 2019). GPL570 measures over 47,000 human transcripts, which consist of the Human Genome U133 Set plus 6500 additional genes. In this study, we downloaded all 602 GPL570 GEO DataSets (GDS) (current as of June 7, 2019). Specifically, for each dataset we obtained normalized, log-transformed expression values at the probeset level. We then mapped these probesets to the non-redundant Entrez Gene IDs (provided by the Bioconductor R package hgu133plus2.db) and obtained gene-level expression values by computing medians across any probe sets mapping to the same gene. With the minimum requirement of 5 samples in each group, we performed pairwise DE analysis among the largest possible collection of non-overlapping sample groups from each GDS experiment. As an example, GDS4281 measures expression in 29 tumor samples from the intraocular cancer uveal melanoma. The samples are divided into male (17) and female (12), as well as right eye (16) and left eye (13). By merging these two groupings, we obtained a total of four non-overlapping groups—male-right eye (10), male-left eye (7), female-right eye (6), and female-left eye (6)—between which we could make the maximum number of \({4 \atopwithdelims ()2} = 6\) pairwise DE comparisons. For each DE comparison, we applied the Limma moderated t-test [39] (using the “lmFit” and “eBayes” functions) to calculate differential expression p-values for each gene. Given a list of p-values for a particular comparison, we adjusted for multiple hypothesis testing using the Benjamini-Hochberg (BH) procedure [46]. Genes with adjusted p-values (false discovery rates or FDRs) \(\le\) 0.05 for a given pairwise comparison were considered DE for that comparison. We calculated the DE frequencies across all comparisons for each gene and converted these frequencies to DE prior probabilities (\(P(DE_i)\) for the ith gene) as follows:

$$\begin{aligned} P(DE_i) = \frac{\sum _{j=1}^M I(FDR_{ij} \le 0.05)}{M} \end{aligned}$$
(1)

where \(i \in \{1, \ldots , N\}\) indexes each gene, \(j \in \{1, \ldots , M\}\) indexes each comparison, \(FDR_{ij}\) represents the FDR for the ith gene in the jth comparison, and \(I(\cdot )\) is the indicator function.

We chose human asthma and cancer validation datasets present as GEO Series (GSE) but not as GEO DataSets (GDS), in order to avoid double counting data. The asthma dataset [47] consists of 404 total samples (transformed lymphoblastoid cell lines) taken from 268 children afflicted with asthma and 136 healthy children. The cancer dataset [48] consists of 870 total bone marrow samples, of which 202, 164, and 69 are from individuals with acute myeloid leukemia (AML), myelodysplastic syndrome (MDS), and neither AML nor MDS, respectively. We considered the three possible comparisons between these three groups. In total, we evaluated four comparisons: Asthma vs Non-asthma, Nonleukemia vs AML, Nonleukemia vs MDS, and AML vs MDS.

For a given comparison, we compute GEOlimma DE posterior probabilities using Bayes’ theorem:

$$\begin{aligned} P(DE_{i} \mid Data) = \frac{P(Data \mid DE_{i}) \, P(DE_{i})}{P(Data)} \end{aligned}$$
(2)

where Data represents the samples making up the given comparison, \(P(Data \mid DE_{i})\) denotes the likelihood of the Data, as calculated by Limma [37], \(P(DE_{i})\) is the previously calculated DE prior probability, and P(Data) is a normalization constant [37]. Given these posterior probabilities, we then calculate B scores (log odds of DE) for each gene as follows:

$$\begin{aligned} B_{i} = \log \left [ \frac{P(DE_{i} \mid Data)}{1 - P(DE_{i} \mid Data)} \right ] \end{aligned}$$
(3)

We implemented GEOlimma as modified R functions based on code from the Limma package.

Enrichment analysis for gene sets

To explore the DE prior probabilities biologically, we conducted KEGG Enrichment Analysis using the R package ClusterProfiler [49]. Specifically, we identified enriched KEGG pathways using the hypergeometric test in both the top and bottom 500 most/least frequently DE genes, separately. Pathways with BH-adjusted p-values less than 0.05 were considered significantly enriched. We used the Pathview R package [50] to visualize the location of DE genes in particular KEGG pathways.

Differential expression analysis

Evaluation datasets

As described above, we downloaded the GSEs for two evaluation datasets from GEO. As with the GDS data, we mapped normalized, log-transformed expression values at the probeset level to non-redundant Entrez Gene IDs and consolidated expression values by computing medians across probe sets mapping to the same gene. We included all genes with unique probe mappings (20,283 total) for subsequent analyses. For each of the four evaluation comparisons, we performed DE analysis on all samples using both GEOlimma and Limma. Genes were considered DE if their BH-adjusted p-value \(\le\) 0.05 (Limma) or their B score exceeded the smallest Limma B score for genes with adjusted p-value \(\le\) 0.05 (GEOlimma).

Sample visualization

To visualize samples, we first used Principal Component Analysis (PCA) to reduce the dimensionality of genes as features. We visualized the first two components of PCA. We further applied the t-Distributed Stochastic Neighbor Embedding (t-SNE) method to visualize the first 10 PCA components in 2 dimensions. t-SNE can reduce the dimensionality of data based on conditional probabilities that preserve local similarity. We used a t-SNE implementation that makes Barnes-Hut approximations, allowing it to be applied on large real-world datasets [51]. We set the perplexity to 15, and sample points were colored using the group information.

Simulation study

We performed a simulation study to evaluate the effect of GEOlimma on false positive and false negative predictions. Specifically, we used the madsim R package to simulate data with the same distribution as the Nonleukemia vs AML comparison, from which we identified 8610 DE genes using Limma as described above and below. We simulated 30 samples for each of the two conditions and generated a total of 50 synthetic datasets. We then performed differential expression analysis in each dataset using both Limma and GEOlimma and considered the highest-scoring 8610 genes in each method as DE. Next, we computed the probabilities that each of the original 8610 DE genes was incorrectly predicted to be non-DE across the simulated datasets as the false negative rates (FNRs). We also computed the probabilities that each of the original 11,673 non-DE genes was incorrectly predicted to be DE across the simulated datasets as the false positive rates (FPRs).

Experimental power

To quantify the performance improvement achieved by GEOlimma vs Limma, we performed DE analysis on small sample size subsets for each comparison. As detailed below, we started with the minimum subset size at which the group proportions for a given comparison could be maintained and generated all non-overlapping sample subsets of this size. We then increased this subset size by the smallest possible sample increment and repeated the generation of subsets. For each sample subset, we first applied both GEOlimma and Limma and ranked genes by their corresponding B scores. Next, using the Limma DE genes previously identified from all samples as the ground truth (see “Results” section for specific numbers), we applied the R package ROCR [41] to calculate Area under the ROC curves (AUCs) for the B score-ranked genes of each subset. We calculated the performance improvement of GEOlimma over Limma for each subset as the difference in AUC between the two methods. In addition, we converted these AUC improvements into gains in effective sample size by constructing and interpolating from a “standard curve” of mean Limma AUC values calculated across the full range of possible sample sizes. As an example, if GEOlimma delivered an AUC improvement of 0.1 over Limma for a subset of size 10, the GEOlimma effective sample size is simply the sample size of the standard curve corresponding to an AUC value 0.1 higher than the mean Limma AUC value for 10-sample subsets.

Supervised classification

We performed supervised classification for each comparison in the evaluation datasets using both GEOlimma and Limma as feature selection methods. Scikit-learn (sklearn) [52] is a Python module implementing machine learning algorithms. It enables various tasks such as dimensionality reduction, classification, regression and model selection. The sklearn classification pipeline involves sequentially applying feature selection, classification, parameter optimization and model selection to yield final classification results. We first used the Python rpy2 module to build a connection between sklearn and the R language, followed by creating customized feature selection methods for Limma and GEOlimma which we compiled into the sklearn pipeline function. For classification training, we first sampled 10 subsets of 40 samples (20 from each of the two groups) at random and selected the 1000 genes with largest variance across these samples. Next, we fed data from each subset to the sklearn pipeline function and performed either Limma or GEOlimma-based feature selection by selecting subsets of 100–1000 genes (in increments of 100) with the highest B scores. We selected the Logistic Regression [53] classifier for classification. We also included L1 and L2 penalties as hyperparameters and applied tenfold cross validation to train the model and optimize the hyperparameters. We used classification AUC as the criterion to evaluate classification performance. A high AUC represents both high recall and high precision, which translate to low false positive and false negative rates. For classification testing, we sampled an additional 40 samples to evaluate the training models. We used a Wilcoxon signed-rank test to identify significant AUC differences between performing feature selection using Limma or GEOlimma.