Objectives

Cellular deconvolution is a computational process that can infer the cellular composition of heterogeneous tissue samples from bulk RNA-sequencing data; it is being increasingly used to help researchers track cell population dynamics, as well as better explore nuclear transcriptional state by controlling for confounding interspecimen differences in cellularity. Well over 50 different algorithms or mathematical approaches been developed for deconvolution [1], and new ones are being proposed regularly.

Generally, these different approaches can have advantages and disadvantages and offer varying levels of performance depending on considerations unique to the specific experimental context, such as the cellular complexity of the tissue under investigation and the desired informational output. In addition to selection of an algorithm, numerous other decisions need to be made when building a deconvolution pipeline for a specific experimental application that can influence accuracy, including the choice of reference gene expression profiles or marker genes, and how to normalize or pre-process the bulk gene expression data [2,3,4]. In order to ensure optimal performance, ideally, these parameters are selected empirically via benchmark testing, which is typically performed using gene expression data from either in vivo, in vitro, or in silico samples comprised of a known mixture of cell types [5]. While it can be argued that the use of in vivo benchmarking datasets represents the gold standard, few in vivo benchmarking datasets exist, particularly for whole blood, which is the single most profiled tissue in human transcriptomic investigations [6].

As part of a larger study which aimed to assess the drivers of peripheral blood gene expression patterns [7], our group recently used a combination of next generation RNA sequencing and flow cytometry to generate a unique dataset containing whole blood gene expression profiles and matched leukocyte counts from a large cohort of human donors. Given the current lack of in vivo benchmarking datasets that exist for whole blood, these data have value for secondary use in evaluating cellular deconvolution pipelines.

Data description

To generate the dataset, parallel venous whole blood specimens were collected from 138 adult donors via K2EDTA and PAXgene vacutainers at admission to the Emergency Department at Dell-Seton Medical Center (Austin, TX) as described by our group previously [7]. K2EDTA vacutainers were used immediately for flow cytometry analysis, while PAXgene vacutainers were stored until downstream RNA isolation.

White blood cell differential was assessed in EDTA-treated whole blood via four angle optical flow cytometry. Relative neutrophil, lymphocyte, monocyte, and eosinophil counts were generated by dividing the absolute counts of the aforementioned leukocyte subpopulations by the absolute total leukocyte count. The final cell counts display a high degree of inter-sample heterogeneity in terms of overall leukocyte composition, and in the case of each cell type, the relative counts span well beyond the adult human reference range (Supplemental Fig. 1) [8], collectively suggesting that the final dataset captures adequate variance in cell counts to be generalizable for use in deconvolution benchmarking.

Total RNA was isolated from PAXgene stabilized whole blood using spin column-based solid phase extraction. RNA purity and integrity were assessed using a combination of spectrophotometry and chip capillary electrophoresis. Ribosomal RNA and globin mRNA-depleted cDNA libraries were prepared and subsequently sequenced via illumina sequencing using paired-end 150 bp reads. Reads were aligned to human reference genome GRCh38 and the counts of mapped were reads summarized at the gene level. On average, approximately 40 million reads were generated per sample, and greater than 90% of reads map to the reference genome (Supplemental Fig. 2) [9]. Transcript from a total of 45,429 genes was detected, including a median of 15,425 protein-coding genes, 4,822 lncRNA associated genes, 2,947 pseudogenes, and 196 miRNA associated genes per sample (Supplemental Fig. 3) [10]. This suggests that the final dataset contains adequate genomic coverage to be compatible with a wide range of reference gene expression profiles and marker gene lists that could be employed in cellular deconvolution benchmarking tests.

Importantly, deconvolution of the final gene expression data via marker genes using a simple principal components analysis-based approach [11] yields inferred cell counts that are highly correlated with the actual cell counts measured with flow cytometry (Spearman’s rho = 0.73–0.84; Supplemental Fig. 4) [12], indicating that the final gene expression data and flow cytometry data are correctly integrated terms of donor-level matching, and that the dataset has true utility for this particular secondary use.

All final data are available from the National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA) via permanent accession number SRP429744 [13]. Raw sequencing data are available as .fastq files and can be downloaded individually or in bulk using the SRA run selector. Quality metrics associated with the source RNA, as well as basic demographic information and white blood cell counts for all donors are available via the attribute slots of linked BioSample records. RNA quality metrics include RNA integrity numbers, 260:230 ratios, and 260:280 ratios, donor demographic information includes age, sex, race, and ethnicity, while white blood cell counts include relative neutrophil, lymphocyte, monocyte, and eosinophil counts, all under accordingly named attribute slots. All cell counts are listed as decimal formatted proportions. These linked BioSample attributes can be bulk downloaded as metadata using the SRA run selector when downloading sequencing data.

Table 1 Overview of files and datasets

Limitations

Like any dataset, there are caveats and limitations that should be considered when planning for future use. Perhaps most notably, it is important to consider that the dataset only contains cell count data for the four most abundant circulating white blood cell populations, as opposed to more granular cell populations which can be more extensively quantified via fluorescent flow cytometry. With respect to future use for benchmarking cellular deconvolution pipelines, this may make the dataset best suited to evaluate the performance of marker-based and reference-based deconvolution approaches such as CIBERSORT [14], xCell [15], and DeMix [16], as the cell types for which counts are to be inferred are dictated a priori by the user. However, even in the instance of reference-free deconvolution approaches, the dataset could still be employed to assess how well the collective inferred counts of any more granular cell types that are output correlate with actual cell counts of the parent cell population.