Background

Epigenome-wide association studies (EWAS) of human disease are becoming increasingly common. DNA methylation (DNAm) is of particular interest because it is dynamic across the lifetime, affected by environmental insults, and previously implicated in developmental disorders and cancer [1]. In these studies, DNAm levels are measured genome-wide at thousands to millions of sites in hundreds of individuals to identify loci where these levels are associated with quantitative traits or disease [1, 2]. Because existing cohort studies that extensively characterize participants often store blood samples, the most widely available tissue for subsequent/retrospective EWAS is whole blood. Furthermore, many studies measure genome-wide DNAm in blood as obtaining disease-relevant tissues is often invasive and/or impossible. With many of these studies completed, few disease-associated loci have been reported outside of cancer [3], type 1 diabetes [4], and rheumatoid arthritis [5]. Instead a series of papers reporting age-related changes of DNAm profiles have been published [614].

Age-related changes in DNAm have been previously reported and functionally described by Chu et al. [15]. In this carefully designed study, fluorescence-activated cell sorting (FACS) was used to separate peripheral blood into pure cellular populations. DNAm was measured in four genomic regions, selected using biological insight, and modest age-related changes were found in CD4+ and CD8+ T cells. In contrast, the above-mentioned EWAS measured DNAm for all CpGs selected by the array manufacturers and used whole blood as a source tissue. Whole blood is a heterogeneous collection of different cell types, each with a very different DNA methylation profile [16, 17]. Observed whole blood DNAm profiles are therefore mixtures of the cell type profiles. In a seminal paper, Houseman et al. [16] describe a statistical method that can accurately estimate relative proportions of cell type components in whole blood. Using practically the same statistical approach, Guintivano et al. [18] describe a method for estimating neuron and non-neuron components in brain samples. However, currently there are no published statistical solutions to parsing age effects by cell type from observed whole blood DNAm measurements.

We examined data from five publicly available studies (Additional file 1) and found strong evidence of cell composition changes across age. Furthermore, we find high levels of confounding between age-related variability and cell composition. We report findings that underscore the importance of accounting for cell composition variability in epigenetic studies based on whole blood and other heterogeneous tissue sources.

Results and discussion

DNAm profiles show large between cell type differences

We downloaded Illumina HumanMethylation450 BeadChip (Illumina 450k) data from flow-sorted neutrophils (granulocytes), lymphocytes (CD8+ and CD4+ T cells, CD56+ natural killer cells and CD19+ B cells) and CD14+ monocytes from six adult male samples (mean age 38 ± 13.6 years) as previously described [17] and confirmed that sorted blood cell types have unique DNAm profiles (Figure S1 in Additional file 2). In fact 63.5% of the CpGs on the Illumina 450k array showed differences with P < 0.05 across these cell types (Figure S1C in Additional file 2).

We used these data to adapt the statistical method developed by Houseman et al.[16] for the Illumina HumanMethylation27 BeadChip (Illumina 27k) array to estimate cell composition from DNAm profiles obtained with its successor, the Illumina 450k. We select a subset of 600 cell-type-specific CpGs (Figure 1) and then use these to estimate proportions in whole blood samples (see Materials and methods). We provide a table with statistical summaries of cell-type variability for all CpGs on the Illumina 450k array (Additional file 3).

Figure 1
figure 1

Illustration of how blood composition drives observed age differences. (A) Heatmap of the cell sorted data shows very clear and consistent DNAm profiles for each cell type. We show 600 probes selected for estimating composition proportions used to demonstrate differences here. (B) To simplify the illustration we selected a section of (A) displaying only the two most abundant cell types: CD4+ T cells and granulocytes. (C) Heatmap of a randomly selected sample of 30 whole blood samples (from the data in Additional file 1) across three age groups (10 per group): between 1 and 5 years of age, between 30 and 40, greater than 60 years. The same probes as in (B) are used. When the samples are ordered by their estimated granulocyte proportion, the samples roughly cluster by age and a similar pattern to (B) is observed. The estimated cell count proportions for each of the samples are shown below. Note the strong confounding between age and cell composition. (D) For the two samples highlighted with an arrow in (C), we show how a weighted average of the cell type profiles can reconstruct the observed DNAm profiles. The numbers shown are the estimated proportions. Note how different weights (cell counts) for old and young result in very different observed DNAm patterns. Note that the differences in CD4+ T cells and granulocytes drive much of the differences in DNAm. NK, CD56+ natural killer cells; CD8T, CD8+ T cells; CD4T, CD4+ T cells, Gran, granulocytes; Bcell, CD19+ B cells; Mono, CD14+ monocytes; DNAm, proportion of DNA methylation at individual CpGs (Illumina 'beta' values, bound between 0 and 1); Prop, cell count proportion, between 0 and 1 for each component, such that they sum to 1.

In sorted samples, cell type explains a larger percentage of variability than age

Given these results, for the purposes of our analysis, we assumed that, for the selected 600 CpGs, the cell type-specific DNAm profiles are the same for all ages. Although we know this assumption does not hold true for all CpGs [15], the results of this section suggest that it is reasonable for most CpGs, and our 600 CpG profile in particular. To demonstrate this, we interrogated two publicly available datasets - the Reinius et al. [17] Illumina 450k data on 6 men (sample ages were obtained from the authors) and Illumina 27k data from sorted CD4+ T cells and monocytes [6] on 24 and 26 subjects, respectively (see Materials and methods). First, we removed CpG probes that showed age associations (at P < 0.05) in the Reinius et al. [17] dataset when picking cell-type-discriminating probes for the cell composition estimation. Additionally, in Rakyan et al. [6] (which was a larger sample) we found that the percentage of variance explained by cell type was much greater than that explained by age within each cell type with most CpGs showing no significant association with age (Figure S2 in Additional file 2). Furthermore, among the 23 CpGs appearing on the Illumina 27k array that were among the 600 cell-type discriminating CpGs (from the Illumina 450k), only one probe (cg03439703) had a p-value < 0.05 when testing for association with age in both CD4+ T cells (P = 0.003) and monocytes (P = 0.047).

Varying cell composition may explain apparent age-associated differences

We downloaded all publicly available DNAm studies in peripheral blood measured with the Illumina 450k array (Additional file 1), re-normalized the data, and applied our method to obtain cell composition estimates for each sample. Note that only three of the studies were focused on finding age-related changes in DNAm [8, 10, 11], but all studies recorded age information. Figure 1 demonstrates that peripheral blood samples indeed appear to be a mixture of pure cellular components, and differences in DNAm may potentially arise merely from differences in the relative proportions of these components rather than site-specific changes in specific cellular populations (Figure 1C).

Cell type proportions change with age following monotonic patterns

We observed consistent age-related changes for the proportions of each cell type (Figure 2). These results are in line with previously published findings related to T cells, namely the involution of the thymus, where T cells in lymphocytes mature. This process begins very early in life [19] and continues with age - the size of the thymus drops approximately 3% per year until the mid-60s, and is approximately 5% the size of the thymus in a newborn [20], suggesting that the number, and diversity, of T cells decreases with age. However, we also note these age-cell count relationships, although monotonic, were non-linear with an inflection point around 40 years (Figure 2). While these findings may be partially attributable to 'batch' effects (given the strong correlation between age and study/dataset), datasets with overlapping age ranges (Liu et al. and Hannum et al.) have consistent age composition trends (Figure S3 in Additional file 2).

Figure 2
figure 2

Cellular composition changes across the lifespan. Estimated cellular composition proportions are plotted against age for (A) CD4+ T cells, (B) CD8+ T cells, (C) natural killer (NK) cells, (D) monocytes (Mono), (E) B cells, and (F) granulocytes (Gran). Color indicates the data source, which are described in Additional file 1. The black lines are curves fit to data with local weighted regression (loess) with confidence intervals in grey. Spearman correlation coefficients are reported for each composition proportion estimate and age.

Cellular composition correlates strongly with global DNAm profiles

Given that blood cell types have very different DNAm profiles (Figure S1 in Additional file 2) and that cell type proportions change across age (Figure 2), we assessed if cell composition was a major source of variability in the five peripheral blood data sets. We computed the first two principal components of the epigenome-wide DNAm profiles across the five studies and compared them to the first principal component of the cell proportion estimates (Figure 3). The correlation between DNAm variance and composition variance was apparent within each study, often to a stronger degree (Figure S4 in Additional file 2). These observed correlations therefore empirically demonstrate that cell composition is a very large source of variability in DNAm data derived from peripheral blood.

Figure 3
figure 3

Cellular composition is a major source of variability in DNAm datasets in whole blood. Principal components (PCs) (A) 1 and (B) 2 of the 456,655 DNAm probes (y-axis) and the first PC of the empirical cell counts (x-axis) are highly correlated. The first PC of the DNAm data explains 10.9% of the variance, and the second explains 9.3% of the variance. Color indicates data source, which are described in Additional file 1.

Confounding between cell composition and age leads to false positives

To determine the adverse effects at the single locus level of the observed confounding between age, cell composition, and DNAm, we reexamined the CpGs reported in the literature to be associated with age [613] across several different measurement platforms (Additional file 4). For each of the CpGs reported to associate with age on the Illumina 450k array (n = 134,489), we tested between-to-within cell type variability on the sorted DNAm data and found that 86.7% of these had P < 0.05 across cell type (Figure S5 in Additional file 2).

A simple linear regression model including the cell composition percentages as covariates has been suggested as a way to adjust for the confounding [5]. We applied this method to the data from Hannum et al. [10] and Alisch et al. [8] and found that the adjusted estimates are, on average, closer to 0 (Figure S6A in Additional file 2). However, at this level of confounding it is not clear that this naïve approach will in fact produce unbiased adjusted estimates (Figure 4A) [21]. We therefore tried two alternative approaches. First we applied the Remove Unwanted Variation (RUV) method [22], an analysis that estimates and adjusts for unknown surrogate variables as done by Leek and Storey [23]. This resulted in much greater, but not complete, attenuation of the age association estimates (Figure 4B; Figure S6B in Additional file 2). Next we obtained age association estimates from fitting the model to data from sorted CD4+ T cells and granulocytes. Note that in these data, cell composition is not a confounder and we see minimal evidence of age association (Figure 4C,D; Figure S6C,D in Additional file 2). We did not implement the adjustment approach suggested by Guintivano et al. [18] because mathematical derivations demonstrated their solution adjusts for confounding in special situations (see Materials and methods).

Figure 4
figure 4

Confounding between cellular composition and age at the CpG level. Comparisons between resulting t-statistics for age on DNA methylation levels in Hannum et al. [10] using (A) naïve (for example, including cell composition estimates as covariates in regression models), (B) two-step Remove Unwanted Variation (RUV), (C) flow-sorted CD4+ T cells and (D) flow-sorted monocytes compared to the effect of age on DNAm in a univariate model. The univariate and naïve models also adjusted for processing plate, which was a very strong confounder. Here, analysis with RUV attenuates the association between DNAm and age. The solid lines indicate the resulting t-statistic cutoff for false discovery rate <5% - no probes were significant at this threshold in the cell sorted data. All panels contain probes present on both the Illumina 450k and 27k (n = 24,692) to facilitate comparisons to age associations in the flow-sorted cellular populations.

Improved biological interpretation after composition filtering

We removed results from Johansson et al. [14] (which reporting one-third of the array was differentially methylated) then mapped the remaining 5,237 age-associated CpGs (Additional file 4) to human genes using the database provided by Triche [24]. For each Gene Ontology category [25] with more than 25 annotated gene IDs we counted the number of CpGs associated with a gene in that category and formed an observed count to expected count ratio (see Materials and methods). We then filtered this list by removing CpGs associated with cell composition and recomputed the observed to expected ratios. With the unfiltered list, 10 of the top 20 enriched categories were clearly related to the immune system while only three were related to development, whereas in the filtered list 9 of the top 20 were associated with developmental processes and only 4 to immune response (Additional file 5).

Conclusions

Whole blood has been one of the most widely used source tissues in EWAS. Here we demonstrate that, in these studies, cellular composition explains much of the observed variability in DNAm. Therefore, when the outcome of interest correlates with cell composition, as age does, failure to account for cellular heterogeneity may result in many false positives. For binary outcomes, for example, we may observe differences between cases and controls, not due to the real differences in DNAm, but rather due to cases and controls having different blood cell counts (Figure 1).

While our re-analysis of publicly available data does not necessarily suggest that all reported age-related DNAm changes in blood are false positives, it certainly suggests that one should account for cellular composition. We therefore recommend that users of the Illumina 450k array studying whole blood perform the cell composition estimation (using, for example, the estimateCellCounts function we have added to the minfi Bioconductor package) and check for possible confounding. If confounding is present, we recommend the use of our table (Additional file 3; also available in the FlowSorted.Blood.450k Bioconductor package) that summarizes cell-type variability for each CpG. Those CpGs with methylation values highly associated with cell-type variability should be treated with skepticism, and we strongly recommend that CpGs associated with both composition and the covariate of interest be validated using FACS-derived cellular populations.

Note that due to the high levels of confounding we currently do not recommend regression approaches for adjustment purposes, but we note that RUV performed best for reducing the composition-based confounding. However, when there is no or minimal confounding, the added unaccounted variability may result in false negatives. In such cases popular factor-based 'batch' correction methodology, like surrogate variable analysis [23], and RUV [22] can empirically estimate and control for cell-type composition.

Note that these confounding problems are not confined to blood, but rather any tissue source that contains a mixture of cell types. Here, careful study design, via targeted validation employing cell sorting within the tissue of interest, can help isolate cell type-specific changes, such as age-related DNAm changes in the pure cellular populations of blood beyond the preliminary negative findings in CD4+ T cells and monocytes from Rakyan et al. [6]. These may better explain observed biological effects, specifically, which epigenetic marks mediate risk for disease or associate with a trait. Characterizing and exploring the effects of cellular heterogeneity is therefore a necessary step in the analysis of genome-wide DNAm data in any heterogeneous tissue source, especially peripheral blood.

Materials and methods

Sample and study selection

There were five publicly available datasets on the Illumina 450k platform [5, 8, 10, 11, 26] performed on blood samples in the Gene Expression Omnibus (GEO) available through the National Center for Biotechnology Information (NCBI) as of February 2013 [27]. We also downloaded cell sorted data described in the Results section from Reinius et al. [17] (GSE35069). Because study and age were almost perfectly confounded, and because there were very strong effects of study in the processed GEO data, we required 'raw' methylated (M) and unmethylated (U) channels from the Illumina 450k to preprocess and normalize all of the samples together, including the cell-sorted dataset. One study, Horvath et al.[12], was not included in the manuscript because the GEO entry lacked raw data. Samples were dropped according to three criteria: 1) missing an age in the database (N = 11); 2) known to be cell-sorted, according to published manuscripts (N = 2, from Heyn et al.[11]); and 3) hypothesized to be cell-sorted, based on granulocyte count values (Figure S7 in Additional file 2), including all centenarian samples from Heyn et al (N = 19), as all appeared to be only granulocytes, and 21 samples from Harris et al.[26], which appeared to be granulocyte-depleted (the manuscript refers to a subset of samples being sorted, but it was not available information in the GEO entry). This left 1,098 samples across 5 studies.

We performed across-array quantile normalization within the M and U channels separately to normalize intensities across samples. Before normalization, we dropped probes on the sex chromosomes (chromosome X = 11,232 and chromosome Y = 416) and also probes that contained an annotated SNP (via dbSNP 137 Common database) in the CpG site (N = 16,756) and at the single base extension site (N = 7,880). This left 456,655 autosomal probes across the epigenome. After normalization, DNAm measurements on the logit scale were calculated as log2(M/U), and then transformed to Illumina’s 'beta' scale (proportion methylation, between 0 and 1). This approach is similar to the 'ABNorm' approach described by Sun et al. [28], but we use the logit transform described above rather than the Illumina approach [M/(M + U + 100)] for calculating the beta values.

Empirically estimating cellular composition using the Illumina 450k microarray

We tailored the algorithm designed by Houseman et al. [16] for the Illumina 27k array to the Illumina 450k array. Briefly, the Houseman algorithm identified 500 CpGs that discriminated cellular composition in flow-sorted cell populations (consisting of CD4+ and CD8+ T cells, B cells, monocytes, natural killer cells, and granulocytes). The algorithm then fits a nonlinear random effects model at each of these CpGs, estimating the coefficient for each cellular component, and then uses these coefficients to predict the relative proportion of each cellular component in peripheral blood samples.

However, there were several reasons that prevented the direct use of Houseman et al.’s algorithm on the 1,098 blood samples obtained on the Illumina 450k. First, while 473 of the 500 composition-discriminating CpGs were present on the Illumina 450k, these probes exhibited slightly different behavior in the two arrays (Figure S8 in Additional file 2). Second, 291 CpGs used by Houseman et al.’s algorithm contained an annotated SNPs (by rs number in the dbSNP137 database) at the CpG site of interest (N = 57), at the single base extension site following the CpG (N = 34) or in the probe sequence itself (N = 200). Problems detailing the inclusion of SNPs in the design of the Illumina 450k have been discussed previously [29, 30], and given our data are from a genetically heterogeneous population, we elected to exclude some of these probes.

We therefore obtained flow-sorted data, including the same six cellular components on six adult male subjects on the Illumina 450k platform [17], and derived our own similar blood composition algorithm using linear modeling across 600 composition-discriminating probes. We computed t-statistics for each cell type after removing probes that associated with age (at P ≤ 0.05), comparing that particular cell type with all others, and selected among the CpGs showing differences at P < 10-8 the 100 most differentially methylated probes by effect size, 50 hypermethylated and 50 hypomethylated. One outlying CD8+ T cell was excluded for the sake of composition estimation. The choices of 50 and 10-8 were somewhat arbitrary, but in-sample cross-validation (via leaving out one sample per cell type, training the model on the remaining 30 samples, and then predicting the 6 left out samples) demonstrated nearly perfect concordance between our estimates of cellular composition and the true values (Figure S9 in Additional file 2).

We also validated the overall algorithm using publicly available brain data from Guintivano et al. [18], which consisted of flow-sorted NeuN + and NeuN- cellular populations from the dorsolateral prefrontal cortex as training data, and then mixture data containing 10% NeuN+/90% NeuN-, 20% NeuN+/80% NeuN-, …, 90% NeuN+/10% NeuN- and bulk tissue data with FACS-derived counts of NeuN + cells as testing data. We processed the data (quantile normalization, dropping probes with SNPs and on sex chromosomes), picked 50 hypo- and hyper-methylated probes, and implemented the algorithm. The algorithm successfully recovered the mixture experiment (correlation = 0.9995; Figure S10A in Additional file 2) and predicted the FACS-derived counts from bulk tissue with moderate accuracy (correlation = 0.786). Lastly, we expect similar accuracy in blood (<10% on the Illumina 27k) as Houseman et al. [16], as we have adapted the algorithm to the Illumina 450k without changing the regression calibration approach.

Software to implement the estimation of cellular compositions from cell-sorted DNAm data is available in the minfi Bioconductor package [31]. Publicly available cell-sorted data, to be used in conjunction with the minfi package, are available in the FlowSorted.Blood.450k Bioconductor package.

Previously published solution does not generally adjust for confounding

Guintivano et al. [18] also provide software that implements a method that they claim can transform data to eliminate (or at least reduce) the confounding effect of cell type heterogeneity on methylation profiles. Although the software is developed for brain, and only for two cell types, one could envision extensions applicable to cases with more cell types such as blood.

However, we offer a mathematical proof demonstrating that the solution offered in the paper only adjusts for confounding in a very special case. To understand the transformation we downloaded the accompanying software package CETS (version 0.99.2) and deciphered it from the R code. Here is a mathematical description of what the transformation does.

Let Y i be the observed methylation profile for the ith individual, a mix of glia (G) and neurons (N). Then we can write:

Y i = π i μ i , N + 1 π i μ i , G + ϵ i

where π i is the proportion of the ith sample that comes from neurons, μ i,N and μ i,G are the profiles for neurons and glia, respectively, and ϵ i is measurement error. In their software, Guintivano et al. [18] provide neuron and glia profiles based on an average across many cell-sorted samples, which we will denote with μ ¯ N and μ ¯ G . It is important to note that these are averages and thus different from the individual profiles. The transformation proposed by Guintivano et al. [18] is:

T Y i = Y i + 1 π i μ ¯ N μ ¯ G

They claim that this will recover the pure neuronal signal μ i,N . But we can do some arithmetic to note that the above can be rewritten as:

μ i , N + 1 π i μ ¯ N μ i , N μ ¯ G μ i , G + ϵ i

Thus, the signal is recovered only when the difference between the individual profiles and the average profiles are the same across cell type, which is not a reasonable, nor useful, assumption.

Variability in sorted cell populations

We downloaded publicly available data from Rakyan et al. [6] at GEO accession GSE20242, which consisted of sorted adult blood samples for monocyte and CD4+ T-cell populations. Linear regression models including i) age, ii) cell type, iii) both age, cell type, and their interaction term were fit at every probe. We summarized each fit with the adjusted R2 (coefficient of determination) . We then examined the P-values for the age terms within each cellular population at our 600 probes from the Illumina 450k used to estimate cellular composition that were also present on the Illumina 27k (n = 23).

Analysis of reported age-associated differentially methylated regions

We downloaded tables for statistically significant age-associated differentially methylated probes or regions (DMRs) from the supplementary material of published manuscripts listed in Additional file 4. For each reported age-associated DMR, we identified the F-statistic (and resulting marginal P-value) for that probe for the effect of composition in the publicly available sorted Illumina 450k data [17].

We applied naïve regression adjustment (for example, adjusting for cell type estimates) and two-step RUV using k = 10 (principal components) in Alisch et al. [8] and k = 30 in Hannum et al. [10], which were determined using diagnostic plots across a range of k values. Univariate regression modeling for Hannum et al. [10] included a categorical 'plate' adjustment variable, as plate and age were strongly associated, and plate and DNAm estimates were also associated. The RUV method requires control probes that are affected by the confounder (cell composition) but not the outcome of interest. We therefore used our 600 probes used to estimate cell type proportion since we showed these had no relation to age in at least two cell types. While it is possible that they are age associated in other cell types the results summarized in Figure S2 in Additional file 2 suggest that this is a useful approximation. With these control probes in place we then let the algorithm estimate the surrogate variables.

We assessed functional significance through enrichment using pre-defined gene sets with the Gene Ontology database. First we mapped each CpG to its Entrez Gene ID [24] for background enrichment (311,817/456,655 probes had an annotated Entrez Gene ID). For each gene set with 25 or more genes, we assessed the number of CpGs that mapped to each gene set. Then we assessed the number of reported age DMR CpGs in the existing gene sets, before (n = 4,691/5,237 mapped to an Entrez ID) and after (n = 1,090/1,209 mapped to an Entrez ID) removing probes that correlated with composition (f-statistic P-value <1 × 10-4 and DNAm range >10%). The observed verses expected ratios were computed for every gene set before and after this composition filtering, and are presented in Additional file 5.

Data availability

All datasets are publicly available in the GEO database [27] at the accessions available in Additional file 1.