Adjusting for Cell Type Composition in DNA Methylation Data Using a Regression-Based Approach

  • Meaghan J. Jones
  • Sumaiya A. Islam
  • Rachel D. Edgar
  • Michael S. Kobor
Protocol
Part of the Methods in Molecular Biology book series (MIMB, volume 1589)

Abstract

Analysis of DNA methylation in a population context has the potential to uncover novel gene and environment interactions as well as markers of health and disease. In order to find such associations it is important to control for factors which may mask or alter DNA methylation signatures. Since tissue of origin and coinciding cell type composition are major contributors to DNA methylation patterns, and can easily confound important findings, it is vital to adjust DNA methylation data for such differences across individuals. Here we describe the use of a regression method to adjust for cell type composition in DNA methylation data. We specifically discuss what information is required to adjust for cell type composition and then provide detailed instructions on how to perform cell type adjustment on high dimensional DNA methylation data. This method has been applied mainly to Illumina 450K data, but can also be adapted to pyrosequencing or genome-wide bisulfite sequencing data.

Keywords:

DNA methylation Illumina Infinium HumanMethylation450 BeadChip Cell type Statistical adjustment R statistical software 

1 Introduction

The number of DNA methylation studies in human populations has been steadily and rapidly rising and this trend will likely continue. With this increase, there is growing appreciation for stringency in analysis and a more complete understanding of important factors to consider when analyzing DNA methylation data. One factor which is now understood to be important is adjusting for interindividual differences in cell type composition of the tissue being interrogated [1, 2, 3, 4, 5].

Since, within a tissue, cell type is the single most important known factor in determining DNA methylation profiles, it then follows that differences in interindividual composition of cell types might significantly confound results from DNA methylation analyses [1, 2, 3, 5]. For example, if the phenotype of interest is associated with a change in cell composition in the tissue being examined, not adjusting for these differences could result in identification of cell type-specific regions as being associated with the phenotype. This issue has been specifically described in studies of rheumatoid arthritis, age, and current socioeconomic status, where in all three cases, differences in cell type composition of white blood cells between individuals were confounded with the phenotype of interest [2, 3, 4]. This would have led to many potential false positives had the researchers not accounted for these differences. Even if the phenotype of interest is not confounded by cell type differences, the large amount of variability due to these differences can alter or mask true associations; therefore adjustment of DNA methylation data for interindividual differences in cell type composition should still be performed. It is worth noting that such adjustments should be performed within a single tissue (i.e., brain samples) and not across different tissue types (i.e., blood versus brain samples).

Many studies have highlighted the need for cell type composition adjustment in tissues composed of multiple cell types [2, 3, 4, 6, 7]. However, not all tissues have received the same scrutiny and not all available methods are appropriate for every tissue. For example, the most commonly used surrogate tissues are buccal epithelial cells (BEC) and blood. Both of these tissues are composed of multiple cell types; blood contains a multitude of white blood cells, while BECs can have some contaminating level of blood or other tissues [7]. However, while most recent DNA methylation studies using blood include adjustment for cell type composition differences, adjustment of BECs has been attempted in only a few cases [7, 8]. It is also important to acknowledge that in a mixed cell population, a change in DNA methylation that is restricted to an underrepresented cell subtype may not be detected, regardless of adjustment. These changes may be highly interesting, but can only be detected if the tissue is fractionated to separate the cell types.

Here, we describe a regression method to adjust for cell type composition in DNA methylation data. This method is appropriate for use in cases where cell type counts are available, often through direct measurement. In the absence of direct cell counts, various cell type composition prediction methods have been established [6, 9, 10]. Current prediction methods are focused on blood and brain, but in the future other tissues may receive the same scrutiny. Although the computational details for each of these prediction algorithms are beyond the scope of our discussion, it is worth noting that these methods primarily utilize differential DNA methylation signatures of each constituent cell type as references to generate projections of cell type proportions in a given tissue. In this chapter, we will first outline how to decide whether this method is applicable to the specific study and then describe the procedure to follow to adjust the data for differences in cell type between individuals. This regression method is a robust approach to adjust for such differences in cell type composition and accordingly represents an appreciable contribution to the increasing rigor of DNA methylation analyses.

2 Materials

Several pieces of information are required to control for cellular composition of a tissue sample using the regression method:

2.1 Cell Counts

Cell counts can come from a variety of sources, which may vary depending on the tissue (seeNote 1). For histological samples, approximate counts from microscopy may be appropriate. For blood, cell counts are often generated by lab-derived Complete Blood Count (CBC, seeNote 2) with differential reports or by fluorescence-activated cell sorting (FACS) analysis (seeNote 3).

If a cell count from the tissue is unavailable, published methods exist for predicting the underlying cellular composition based on the DNA methylation profile of the tissue for blood and brain [6, 9, 10, 11]. These methods have been used extensively and have proven to be highly reliable in many cases (seeNote 4). In the script below, cell counts are contained in an object called “diff”, which is a matrix of cell counts with samples as rows, and cell types as columns (seeNote 1).

2.2 DNA Methylation Profiles

Described here is the method commonly used for adjustment of DNA methylation data generated by the Illumina 450K array (seeNote 5). For 450K analysis, we recommend that cellular composition adjustment be done after initial preprocessing and quality control checks including probe filtering, normalization, and batch correction according to the pipeline of choice. The input into the script below is an object called “betas” (seeNote 6), which is a matrix of beta values with CpG probes as rows and samples as columns in the same order as the rows in the diff matrix (seeNote 7).

2.3 Statistical Software

R statistical software with R-specific script is commonly used for this method (seeNote 8) [12].

3 Methods

First, a decision must be made regarding whether the regression method can be applied to the specific project in question, or whether reference-free methods must be used (seeNote 9). A flow chart is laid out in Fig. 1 illustrating the best choices for particular projects. Importantly, this should only be used if there is reason to expect that the tissue being assessed contains a mixture of cell types which might differ across individuals.
Fig. 1

Flow chart describing decision tree to determine whether cell type adjustment using the regression/residual method is appropriate, or whether a reference-free method should be used (methods in purple). The regression method outlined in this chapter can be used in any case where the cellular composition of the sample is known or can be predicted (with blood or brain prediction algorithms, shown in yellow)

Once the appropriate information has been gathered, as outlined in Section 2, you can proceed to adjust DNA methylation data for cell composition differences using the regression method. The process to adjust the data is as follows, with the appropriate R code and annotation.
  1. 1.

    beta.lm<-apply(betas, 1, function(x){diff[colnames(betas),]->blood

    lm(x~CD8T+CD4T+NK+Bcell+Mono+Gran,data=blood)})

    First, for each probe in the beta matrix, fit a linear model on the DNA methylation measures using cell type proportions as additive variables (here illustrated using blood cell types, seeNotes 2, 10, and 11). This estimates the degree of DNA methylation variability that is predicted by the underlying cell type composition for each probe.

     
  2. 2.

    residuals<-t(sapply(beta.lm,function(x)residuals(summary(x))))

    colnames(residuals)<-colnames(betas)

    Next, extract a matrix of residuals from the resulting linear models. For each probe, the residuals are calculated as the difference between the observed methylation values and predicted methylation values from the fitted linear model. These residuals represent the remaining DNA methylation variability that are unexplained by cell type composition and may accordingly be explained by other phenotypic factors of interest. Since a linear model is fit to each probe individually, sites with methylation levels that are less affected by cell type composition will accordingly be modified to a lesser extent than a probe that is highly associated with cell type composition.

     
  3. 3.

    adj.betas<-residuals+matrix(apply(betas, 1, mean), nrow=nrow(residuals), ncol=ncol(residuals))

    Next, add the residuals of each regression model to the mean methylation value of each probe (mean across all samples) to obtain the “adjusted” methylation data.

     
  4. 4.

    adj.m<-beta2m(adj.betas)

    Finally, and optionally, perform a logit transformation to convert the beta values back to M values for downstream statistical analysis using the beta2m function in the lumi R package (seeNote 12) [13].

     
  5. 5.
    (Optional) Perform Principal Component Analysis (PCA) on the original beta value matrix and the adjusted beta values to determine whether any variation associated with cell type has been removed from the data (seeNote 13). An example of how this should appear is shown in Fig. 2.
    Fig. 2

    Illustration of ideal results after using PCA to assess the effect of adjusting DNA methylation data for cell type composition. Heatmap indicates p values of correlations between the top 20 Principal Components (PCs) and variables. Prior to cell type adjustment (left), many of the top PCs are associated with cell types, some of which are confounded with sex and age (serving as example test variables). After cell type adjustment (right) the associations with cell type are no longer observed, but the signal from the two test variables is still clear

     

4 Notes

  1. 1.

    Cell counts can be represented as either proportions (in percent) or absolute cell counts, but these should be treated differently when fitting the linear models in Step 1 of the regression method, as described in Note 10.

     
  2. 2.

    Whole blood samples are often fractionated to remove granulocytes, resulting in an enriched population of mononuclear cells, prior to DNA methylation analysis. It should be noted that a lab-generated CBC differential report from whole blood may not accurately reflect the cellular composition of mononuclear cells isolated from the same sample. Thus, for mononuclear cells, if post-mononuclear cell enrichment counts are not available, it is generally recommended that blood cell type prediction methods be used to generate accurate cell counts [9, 11, 14].

     
  3. 3.

    FACS-derived counts are often highly accurate representations of actual cellular composition of a sample; however particular care must be taken with the staining and isolation of cells to ensure that artifacts are not introduced. For example, specifically increased mortality of a single cell type in the preparation could skew the results and underestimate the true proportions of those cells.

     
  4. 4.

    The commonly used deconvolution methods for DNA methylation are available for brain and blood, the former found in the CETS R package and the latter in the minfi R package [6, 9, 14]. Both these packages output a matrix of cell counts for each sample. A new method for deconvolution for brain is also available [10]. While highly reliable for samples from adults, it is possible that the blood deconvolution in particular is less accurate for pediatric samples. There is also a possibility that ethnicity, environment, or health status may affect the accuracy of these predictions, if these factors greatly affect reference methylation profiles at sites used for the prediction. Thus, care should be taken when applying these methods to samples to ensure that confounding factors are not affecting the quality of cell type prediction. If cell counts are available for a subset of samples, or for similar samples, cross-validating the prediction with the known cell composition is an important check.

     
  5. 5.

    In addition to Illumina 450K array data, this same procedure should be applicable to pyrosequencing, Reduced Representation Bisulfite Sequencing (RRBS) or Whole Genome Bisulfite Sequencing (WGBS) data.

     
  6. 6.

    Beta values are between 0 and 1, where 0 represents 0 % DNA methylation and 1 represents 100 % methylation. Due to heteroscedasticity of beta values, M values are often used for statistical analysis [15]. For cell type adjustment, beta values are the appropriate measure, but they should be converted back to M values prior to downstream analysis.

     
  7. 7.

    Missing values (NAs) in the beta value matrix must be imputed prior to fitting the linear models. Any missing values can be replaced by the probe median value before Step 1 and the NAs should be replaced after Step 3.

     
  8. 8.

    Although we have described the method using R statistical software and scripts, the regression method should be adaptable to any software package.

     
  9. 9.

    If no cell count is available or predictable for the tissue in question (i.e., tissue is not blood or brain) and there is reasonable expectation that cell type composition would differ between individuals, reference-free methods or surrogate variable methods may be the best choice [16, 17, 18]. However, these methods have some limitations. They have been specifically designed for array analysis and may not be transferable to RRBS or WGBS data. These reference-free methods in particular are also designed as full analysis packages, with little control or oversight into the intervening steps.

     
  10. 10.

    If absolute cell counts are used in the regression method, all counted cell types should be included in the model. However, if percent proportions are used, where all the cell types counted add to 100 %, one of the cell type columns should be removed to serve as the intercept and avoid over-fitting. Note that the lm function in R automatically removes one of these columns by default if the values are in percent (seeStep 1 in Section 3).

     
  11. 11.

    It is important to note that the regression method adjusts the data independently of other covariates, resulting in a matrix of beta or M values in which the effects of cell composition have been removed. This is slightly different from another common method, which is to add the cell type variables as covariates in the linear modes used in the analysis itself [4]. We feel that the regression method is superior in most cases because downstream analyses are not required to include cell type variables each time. This is helpful for analyses using methods where incorporation of extra variables is difficult, such as hierarchical clustering.

     
  12. 12.

    After cell type adjustment, some beta values may have been scaled to numbers higher than 1 or lower than 0. It is important to change these numbers before converting to M values, as they will result in values of infinity or −infinity when converted. Our procedure is to replace any numbers higher than 1 with the highest value that is less than one, and similarly to replace any values lower than 0 with the lowest non-negative number.

     
  13. 13.

    In the specific case where cell type composition is confounded with a variable of interest, it should be apparent in the PCA analysis. This represents a potentially highly interesting aspect of the phenotype, but does complicate DNA methylation studies. Adjusting the data for cell type composition is extremely important in a case such as this, but it is important to be aware that the adjustment may remove some of the DNA methylation signal associated with the phenotype. In order to find pure signals associated with the phenotype, purification of a single cell type may be required.

     

References

  1. 1.
    Reinius LE, Acevedo N, Joerink M et al (2012) Differential DNA methylation in purified human blood cells: implications for cell lineage and studies on disease susceptibility. PLoS One 7:e41361CrossRefPubMedPubMedCentralGoogle Scholar
  2. 2.
    Jaffe AE, Irizarry RA (2014) Accounting for cellular heterogeneity is critical in epigenome-wide association studies. Genome Biol 15:R31CrossRefPubMedPubMedCentralGoogle Scholar
  3. 3.
    Lam LL, Emberly E, Fraser HB et al (2012) Factors underlying variable DNA methylation in a human community cohort. Proc Natl Acad Sci U S A 109(Suppl 2):17253–17260CrossRefPubMedPubMedCentralGoogle Scholar
  4. 4.
    Liu Y, Aryee MJ, Padyukov L et al (2013) Epigenome-wide association data implicate DNA methylation as an intermediary of genetic risk in rheumatoid arthritis. Nat Biotechnol 31:142–147CrossRefPubMedPubMedCentralGoogle Scholar
  5. 5.
    Lowe R, Rakyan VK (2014) Correcting for cell-type composition bias in epigenome-wide association studies. Genome Med 6:23CrossRefPubMedPubMedCentralGoogle Scholar
  6. 6.
    Guintivano J, Aryee MJ, Kaminsky ZA (2013) A cell epigenotype specific model for the correction of brain cellular heterogeneity bias and its application to age, brain region and major depression. Epigenetics 8:290–302CrossRefPubMedPubMedCentralGoogle Scholar
  7. 7.
    Jones MJ, Farré P, McEwen LM et al (2013) Distinct DNA methylation patterns of cognitive impairment and trisomy 21 in down syndrome. BMC Med Genomics 6:58CrossRefPubMedPubMedCentralGoogle Scholar
  8. 8.
    Smith AK, Kilaru V, Klengel T et al (2014) DNA extracted from saliva for methylation studies of psychiatric traits: evidence tissue specificity and relatedness to brain. Am J Med Genet 168:36–44CrossRefGoogle Scholar
  9. 9.
    Houseman EA, Accomando WP, Koestler DC et al (2012) DNA methylation arrays as surrogate measures of cell mixture distribution. BMC Bioinform 13:86CrossRefGoogle Scholar
  10. 10.
    Montaño CM, Irizarry RA, Kaufmann WE et al (2013) Measuring cell-type specific differential methylation in human brain tissue. Genome Biol 14:R94CrossRefPubMedPubMedCentralGoogle Scholar
  11. 11.
    Koestler DC, Christensen B, Karagas MR et al (2013) Blood-based profiles of DNA methylation predict the underlying distribution of cell types: a validation analysis. Epigenetics 8:816–826CrossRefPubMedPubMedCentralGoogle Scholar
  12. 12.
    D.C.T. R (2008) R: a language and environment for statistical computing. R Foundation for Statistical Computing, ViennaGoogle Scholar
  13. 13.
    Du P, Kibbe WA, Lin SM (2008) lumi: a pipeline for processing Illumina microarray. Bioinformatics 24:1547–1548CrossRefPubMedGoogle Scholar
  14. 14.
    Aryee MJ, Jaffe AE, Corrada-Bravo H et al (2014) Minfi: a flexible and comprehensive Bioconductor package for the analysis of Infinium DNA methylation microarrays. Bioinformatics 30:1363–1369CrossRefPubMedPubMedCentralGoogle Scholar
  15. 15.
    Du P, Zhang X, Huang C-C et al (2010) Comparison of Beta-value and M-value methods for quantifying methylation levels by microarray analysis. BMC Bioinform 11:587CrossRefGoogle Scholar
  16. 16.
    Zou J, Lippert C, Heckerman D et al (2014) Epigenome-wide association studies without the need for cell-type composition. Nat Methods 11:309–311CrossRefPubMedGoogle Scholar
  17. 17.
    Houseman EA, Molitor J, Marsit CJ (2014) Reference-free cell mixture adjustments in analysis of DNA methylation data. Bioinformatics 30:1431–1439CrossRefPubMedPubMedCentralGoogle Scholar
  18. 18.
    Leek JT, Johnson WE, Parker HS et al (2012) The sva package for removing batch effects and other unwanted variation in high-throughput experiments. Bioinformatics 28:882–883CrossRefPubMedPubMedCentralGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2015

Authors and Affiliations

  • Meaghan J. Jones
    • 1
  • Sumaiya A. Islam
    • 1
  • Rachel D. Edgar
    • 1
  • Michael S. Kobor
    • 1
  1. 1.Department of Medical Genetics, Centre for Molecular Medicine and Therapeutics, Child and Family Research InstituteUniversity of British ColumbiaVancouverCanada

Personalised recommendations