Abstract
Following the rapid development and adoption in DNA methylation microarray assays, we are now experiencing a growth in the number of statistical tools to analyze the resulting large-scale data sets. As is the case for other microarray applications, biases caused by technical issues are of concern. Some of these issues are old (e.g., two-color dye bias and probe- and array-specific effects), while others are new (e.g., fragment length bias and bisulfite conversion efficiency). Here, I highlight characteristics of DNA methylation that suggest standard statistical tools developed for other data types may not be directly suitable. I then describe the microarray technologies most commonly in use, along with the methods used for preprocessing and obtaining a summary measure. I finish with a section describing downstream analyses of the data, focusing on methods that model percentage DNA methylation as the outcome, and methods for integrating DNA methylation with gene expression or genotype data.
Similar content being viewed by others
References
Agius P, Campbell C (2009) Bayesian unsupervised learning with multiple data types bayesian unsupervised learning with multiple data types. Statistical applications in genetics and molecular biology 8: Article 27
Aryee MJ, Wu Z, Ladd-Acosta C, Herb B, Feinberg AP, Yegnasubramanian S, Irizarry RA (2011) Accurate genome-scale percentage DNA methylation estimates from microarray data. Biostatistics 12(2):197–210
Bell JT, Pai AA, Pickrell JK, Gaffney DJ, Pique-Regi R, Degner JF, Gilad Y, Pritchard JK (2011) DNA methylation patterns associate with genetic and gene expression variation in HapMap cell lines. Genome Biol 12:R10
Bibikova M, Lin Z, Zhou L, Chudin E, Garcia EW, Wu B, Doucet D, Thomas NJ, Wang Y, Vollmer E, Goldmann T, Seifart C, Jiang W, Barker DL, Chee MS, Floros J, Fan J-B (2006) High-throughput DNA methylation profiling using universal bead arrays. Genome Res 16:383–393
Bird A (2002) DNA methylation patterns and epigenetic memory. Genes Dev 16:6–21
Bock C, Tomazou EM, Brinkman AB, Müller F, Simmer F, Gu H, Jäger N, Gnirke A, Stunnenberg HG, Meissner A (2010) Quantitative comparison of genome-wide DNA methylation mapping technologies. Nat Biotechnol 28:1106–1114
Chavez L, Jozefczuk J, Grimm C, Dietrich J, Timmermann B, Lehrach H, Herwig R, Adjaye J (2010) Computational analysis of genome-wide DNA methylation during the differentiation of human embryonic stem cells along the endodermal lineage. Genome Res 20:1441–1450
Coarfa C, Yu F, Miller CA, Chen Z, Harris RA, Milosavljevic A (2010) Pash 3.0: a versatile software package for read mapping and integrative analysis of genomic and epigenomic variation using massively parallel DNA sequencing. BMC Bioinformatics 11:572
Down TA, Rakyan VK, Turner DJ, Flicek P, Li H, Kulesha E, Graf S, Johnson N, Herrero J, Tomazou EM, Thorne NP, Backdahl L, Herberth M, Howe KL, Jackson DK, Miretti MM, Marioni JC, Birney E, Hubbard TJ, Durbin R, Tavare S, Beck S (2008) A Bayesian deconvolution strategy for immunoprecipitation-based DNA methylome analysis. Nat Biotechnol 26:779–785
Du P, Kibbe Wa, Lin SM (2008) lumi: a pipeline for processing Illumina microarray. Bioinformatics (Oxford, England) 24:1547–1548
Du P, Zhang X, Huang C-C, Jafari N, Kibbe WA, Hou L, Lin SM (2010) Comparison of Beta-value and M-value methods for quantifying methylation levels by microarray analysis. BMC Bioinformatics 11:587
Dunning MJ, Smith ML, Ritchie ME, Tavare S (2007) beadarray: R classes and methods for Illumina bead-based data. Bioinformatics 23:2183–2184
Dunning MJ, Barbosa-Morais NL, Lynch AG, Tavaré S, Ritchie ME (2008) Statistical issues in the analysis of Illumina data. BMC Bioinformatics 9:85
Eckhardt F, Lewin J, Cortese R, Rakyan VK, Attwood J, Burger M, Burton J, Cox TV, Davies R, Down TA, Haefliger C, Horton R, Howe K, Jackson DK, Kunde J, Koenig C, Liddle J, Niblett D, Otto T, Pettett R, Seemann S, Thompson C, West T, Rogers J, Olek A, Berlin K, Beck S (2006) DNA methylation profiling of human chromosomes 6, 20 and 22. Nat Genet 38:1378–1385
Ferrari S, Cribari-Neto F (2004) Beta regression for modelling rates and proportions. J Appl Stat 31:799–815
Fuke C, Shimabukuro M, Petronis A, Sugimoto J, Oda T, Miura K, Miyazaki T, Ogura C, Okazaki Y, Jinno Y (2004) Age related changes in 5-methylcytosine content in human peripheral leukocytes and placentas: an HPLC-based study. Ann Hum Genet 68:196–204
Harris RA, Wang T, Coarfa C, Nagarajan RP, Hong C, Downey SL, Johnson BE, Fouse SD, Delaney A, Zhao Y, Olshen A, Ballinger T, Zhou X, Forsberg KJ, Gu J, Echipare L, O’Geen H, Lister R, Pelizzola M, Xi Y, Epstein CB, Bernstein BE, Hawkins RD, Ren B, Chung W-Y, Gu H, Bock C, Gnirke A, Zhang MQ, Haussler D, Ecker JR, Li W, Farnham PJ, Waterland RA, Meissner A, Marra MA, Hirst M, Milosavljevic A, Costello JF (2010) Comparison of sequencing-based methods to profile DNA methylation and identification of monoallelic epigenetic modifications. Nat Biotechnol 28:1097–1105
Houseman EA, Christensen BC, Yeh R-F, Marsit CJ, Karagas MR, Wrensch M, Nelson HH, Wiemels J, Zheng S, Wiencke JK, Kelsey KT (2008) Model-based clustering of DNA methylation array data: a recursive-partitioning algorithm for high-dimensional data arising as a mixture of beta distributions. BMC Bioinformatics 9:365
Irizarry RA, Hobbs B, Collin F, Beazer-Barclay YD, Antonellis KJ, Scherf U, Speed TP (2003) Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics 4:249–264
Irizarry RA, Ladd-Acosta C, Carvalho B, Wu H, Brandenburg SA, Jeddeloh JA, Wen B, Feinberg AP (2008) Comprehensive high-throughput arrays for relative methylation (CHARM). Genome Res 18:780–790
Jeong J, Li L, Liu Y, Nephew KP, Huang TH-M, Shen C (2010) An empirical Bayes model for gene expression and methylation profiles in antiestrogen resistant breast cancer. BMC Medical Genomics 3:55
Ji H, Ehrlich LI, Seita J, Murakami P, Doi A, Lindau P, Lee H, Aryee MJ, Irizarry RA, Kim K, Rossi DJ, Inlay MA, Serwold T, Karsunky H, Ho L, Daley GQ, Weissman IL, Feinberg AP (2010) Comprehensive methylome map of lineage commitment from haematopoietic progenitors. Nature 467:338–342
Johnson WE, Li W, Meyer Ca, Gottardo R, Carroll JS, Brown M, Liu XS (2006) Model-based analysis of tiling-arrays for ChIP-chip. Proc Natl Acad Sci USA 103:12457–12462
Jones PA, Baylin SB (2007) The epigenomics of cancer. Cell 128:683–692
Kelly TK, De Carvalho DD, Jones PA (2010) Epigenetic modifications as therapeutic targets. Nat Biotechnol 28:1069–1078
Khalili A, Huang T, Lin S (2009) A robust unified approach to analyzing methylation and gene expression data. Comput Stat Data Anal 53:1701–1710
Kim RS, Lin J (2011) Multi-level mixed effects models for bead arrays. Bioinformatics 27(5):633–640
Koestler DC, Marsit CJ, Christensen BC, Karagas MR, Bueno R, Sugarbaker DJ, Kelsey KT, Houseman EA (2010) Semi-supervised recursively partitioned mixture models for identifying cancer subtypes. Bioinformatics 26:2578–2585
Krzywinski M, Schein J, Birol I, Connors J, Gascoyne R, Horsman D, Jones SJ, Marra MA (2009) Circos: an information aesthetic for comparative genomics. Genome Res 19:1639–1645
Kuan PF, Wang S, Zhou X, Chu H (2010) A statistical framework for Illumina DNA methylation arrays. Bioinformatics 26:2849–2855
Laird PW (2003) The power and the promise of DNA methylation markers. Nat Rev Cancer 3:253–266
Laird PW (2010) Principles and challenges of genome-wide DNA methylation analysis. Nat Rev Genetics 11:191–203
Leek JT, Scharpf RB, Bravo HC, Simcha D, Langmead B, Johnson WE, Geman D, Baggerly K, Ra Irizarry (2010) Tackling the widespread and critical impact of batch effects in high-throughput data. Nat Rev Genetics 11:733–739
Li Y, Zhu J, Tian G, Li N, Li Q, Ye M, Zheng H, Yu J, Wu H, Sun J, Zhang H, Chen Q, Luo R, Chen M, He Y, Jin X, Zhang Q, Yu C, Zhou G, Sun J, Huang Y, Zheng H, Cao H, Zhou X, Guo S, Hu X, Li X, Kristiansen K, Bolund L, Xu J, Wang W, Yang H, Wang J, Li R, Beck S, Wang J, Zhang X (2010) The DNA methylome of human peripheral blood mononuclear cells. PLoS Biol 8:e1000533
Lister R, Pelizzola M, Dowen RH, Hawkins RD, Hon G, Tonti-Filippini J, Nery JR, Lee L, Ye Z, Ngo Q-M, Edsall L, Antosiewicz-Bourget J, Stewart R, Ruotti V, Millar AH, Thomson JA, Ren B, Ecker JR (2009) Human DNA methylomes at base resolution show widespread epigenomic differences. Nature 462:315–322
Loss LA, Sadanandam A, Durinck S, Nautiyal S, Flaucher D, Carlton VEH, Moorhead M, Lu Y, Gray JW, Faham M, Spellman P, Parvin B (2010) Prediction of epigenetically regulated genes in breast cancer cell lines. BMC Bioinformatics 11:305
Lynch AG, Dunning MJ, Iddawela M, Barbosa-Morais NL, Ritchie ME (2009) Considerations for the processing and analysis of GoldenGate-based two-colour Illumina platforms. Stat Methods Med Res 18:437–452
Marsit CJ, Christensen BC, Houseman EA, Karagas MR, Wrensch MR, Yeh RF, Nelson HH, Wiemels JL, Zheng S, Posner MR, McClean MD, Wiencke JK, Kelsey KT (2009) Epigenetic profiling reveals etiologically distinct patterns of DNA methylation in head and neck squamous cell carcinoma. Carcinogenesis 30:416–422
Noushmehr H, Weisenberger DJ, Diefes K, Phillips HS, Pujara K, Berman BP, Pan F, Pelloski CE, Sulman EP, Bhat KP, Verhaak RGW, Hoadley KA, Hayes DN, Perou CM, Schmidt HK, Ding L, Wilson RK, Van Den Berg D, Shen H, Bengtsson H, Neuvial P, Cope LM, Buckley J, Herman JG, Baylin SB, Laird PW, Aldape K (2010) Identification of a CpG island methylator phenotype that defines a distinct subgroup of glioma. Cancer Cell 17:510–522
Oda M, Glass JL, Thompson RF, Mo Y, Olivier EN, Figueroa ME, Selzer RR, Richmond TA, Zhang X, Dannenberg L, Green RD, Melnick A, Hatchwell E, Bouhassira EE, Verma A, Suzuki M, Greally JM (2009) High-resolution genome-wide cytosine methylation profiling with simultaneous copy number analysis and optimization for limited cell numbers. Nucleic Acids Res 37:3829–3839
Ordway JM, Curran T (2002) Methylation matters: modeling a manageable genome. Cell Growth Differ 13:149–162
Ordway JM, Bedell JA, Citek RW, Nunberg A, Garrido A, Kendall R, Stevens JR, Cao D, Doerge RW, Korshunova Y, Holemon H, McPherson JD, Lakey N, Leon J, Martienssen RA, Jeddeloh JA (2006) Comprehensive DNA methylation profiling in a human cancer genome identifies novel epigenetic targets. Carcinogenesis 27:2409–2423
Parkhomenko E, Tritchler D, Beyene J (2007) Genome-wide sparse canonical correlation of gene expression with genotypes. BMC Proc 1(Suppl 1):S119
Parkhomenko E, Tritchler D, Beyene J (2009) Sparse canonical correlation analysis with application to genomic data integration. Stat Appl Genet Mol Biol 8:1
Pelizzola M, Koga Y, Urban AE, Krauthammer M, Weissman S, Halaban R, Molinaro AM (2008) MEDME: an experimental and analytical methodology for the estimation of DNA methylation levels based on microarray derived MeDIP-enrichment. Genome Res 18:1652–1659
Portela A, Esteller M (2010) Epigenetic modifications and human disease. Nat Biotechnol 28:1057–1068
Potter DP, Yan P, Huang THM, Lin S (2008) Probe signal correction for differential methylation hybridization experiments. BMC Bioinformatics 9:453
Rauch T, Li H, Wu X, Pfeifer GP (2006) MIRA-assisted microarray analysis, a new technology for the determination of DNA methylation patterns, identifies frequent methylation of homeodomain-containing genes in lung cancer cells. Cancer Res 66:7939–7947
Robinson MD, McCarthy DJ, Smyth GK (2010a) edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26:139–140
Robinson MD, Stirzaker C, Statham AL, Coolen MW, Song JZ, Nair SS, Strbenac D, Speed TP, Clark SJ (2010b) Evaluation of affinity-based genome-wide DNA methylation data: effects of CpG density, amplification bias, and copy number variation. Genome Res 20:1719–1729
Shen R, Olshen AB, Ladanyi M (2009) Integrative clustering of multiple genomic data types using a joint latent variable model with application to breast and lung cancer subtype analysis. Bioinformatics 25:2906–2912
Shi W, Oshlack A, Smyth GK (2010) Optimizing the noise versus bias trade-off for Illumina whole genome expression BeadChips. Nucleic Acids Res 38:e204
Siegmund KD, Lin S (2007) Epigenetics. In: Balding DJ, Bishop M, Cannings C (eds) Handbook of statistical genetics, vol 2, 3rd edn. Wiley and Sons, Chichester, pp 1301–1317
Silver JD, Ritchie ME, Smyth GK (2009) Microarray background correction: maximum likelihood estimation for the normal-exponential convolution. Biostatistics 10:352–363
Song JS, Johnson WE, Zhu X, Zhang X, Li W, Manrai AK, Liu JS, Chen R, Liu XS (2007) Model-based analysis of two-color arrays (MA2C). Genome Biol 8:R178
Statham AL, Strbenac D, Coolen MW, Stirzaker C, Clark SJ, Robinson MD (2010) Repitools: an R package for the analysis of enrichment-based epigenomic data. Bioinformatics 26:1662–1663
Strachan T, Read AP (1999) Human molecular genetics, 2nd edn. Wiley-Liss, New York
Sun S, Yan PS, Huang THM, Lin S (2009) Identifying differentially methylated genes using mixed effect and generalized least square models. BMC Bioinformatics 10:404
Task E, Board SA (2008) Moving AHEAD with an international human epigenome project. Nature 454:711–715
Teschendorff AE, Menon U, Gentry-Maharaj A, Ramus SJ, Gayther SA, Apostolidou S, Jones A, Lechner M, Beck S, Jacobs IJ, Widschwendter M (2009) An epigenetic signature in peripheral blood predicts active ovarian cancer. PloS One 4:e8274
Thompson RF, Reimers M, Khulan B, Gissot M, Richmond TA, Chen Q, Zheng X, Kim K, Greally JM (2008) An analytical pipeline for genomic representations used for cytosine methylation studies. Bioinformatics 24:1161–1167
Tycko B (2010) Allele-specific DNA methylation: beyond imprinting. Hum Mol Genet 19:210–220
van der Laan MJ, Pollard KS (2003) Hybrid clustering of gene expression data with visualization and the bootstrap. J Stat Plan Inference 117:275–303
Wang XM, Greiner TC, Bibikova M, Pike BL, Siegmund KD, Sinha UK, Muschen M, Jaeger EB, Weisenburger DD, Chan WC, Shibata D, Fan JB, Hacia JG (2010) Identification and functional relevance of de novo DNA methylation in cancerous B-cell populations. J Cell Biochem 109:818–827
Weber M, Davies JJ, Wittig D, Oakeley EJ, Haase M, Lam WL, Schübeler D (2005) Chromosome-wide and promoter-specific analyses identify sites of differential DNA methylation in normal and transformed human cells. Nat Genet 37:853–862
Witten DM, Tibshirani RJ (2009) Extensions of sparse canonical correlation analysis with applications to genomic data. Statistical applications in genetics and molecular biology. 8:28
Wolff EM, Chihara Y, Pan F, Weisenberger DJ, Siegmund KD, Sugano K, Kawashima K, Laird PW, Jones PA, Liang G (2010) Unique DNA methylation patterns distinguish noninvasive and invasive urothelial cancers and establish an epigenetic field defect in premalignant tissue. Cancer Res 70:8169–8178
Wu Z, Aryee MJ (2010) Subset quantile normalization using negative control features. J Comput Biol 17:1267–1277
Wu Z, Irizarry RA, Gentleman R, Martinez-Murillo F, Spencer F (2004) A model-based background adjustment for oligonucleotide expression arrays. J Am Stat Assoc 99:909–917
Xie Y, Wang X, Story M (2009) Statistical methods of background correction for Illumina BeadArray data. Bioinformatics 25:751–757
Yang YH, Dudoit S, Luu P, Lin DM, Peng V, Ngai J, Speed TP (2002) Normalization for cDNA microarray data: a robust composite method addressing single and multiple slide systematic variation. Nucleic Acids Res 30:e15
Zhang D, Cheng L, Badner JA, Chen C, Chen Q, Luo W, Craig DW, Redman M, Gershon ES, Liu C (2010) Genetic control of individual differences in gene-specific methylation in human brain. Am J Hum Genet 86:411–419
Acknowledgments
I would like to thank Dr. Joe Hacia for his comments on an early draft and Dr. Christina Curtis for discussions regarding methods for data integration. I would also like to thank Tim Triche Jr. for his work on Beta Regression and the preprocessing of DNA methylation data from Illumina’s Infinium platform, and Dr. Peter W. Laird for the many helpful discussions over the years. This work was supported by NCI grant number R01 CA097346 and NIEHS grant number P30 ES07048. The content is solely the responsibility of the author and does not necessarily represent the official views of the National Cancer Institute or the National Institutes of Health.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Siegmund, K.D. Statistical approaches for the analysis of DNA methylation microarray data. Hum Genet 129, 585–595 (2011). https://doi.org/10.1007/s00439-011-0993-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00439-011-0993-x