Model-Based Clustering of DNA Methylation Array Data

  • Devin C. Koestler
  • E. Andrés Houseman
Part of the Translational Bioinformatics book series (TRBIO, volume 7)


Clustering refers to the “grouping” of observations into a discrete set of classes, such that observations in the same class are more similar compared to objects between classes. In the context of DNA methylation data, clustering can be used to discover novel molecular subtypes or to identify biological pathways comprised of co-methylated CpG dinucleotides, depending on whether the samples or the CpGs themselves are being clustered. In this chapter, we focus on the problem of clustering samples/subjects on the basis of their methylation profile. We begin by discussing the motivation behind clustering DNA methylation data, the nature of DNA methylation data generated from the Illumina BeadArrays, and three promising model-based clustering methods. In addition to providing a methodological overview of each of the three methods, we also demonstrate their application using a publicly available data set deposited in the Gene Expression Omnibus (GEO) database. Issues such as feature selection and comparison of clustering partitions will also be discussed.


Model-based clustering Finite mixture models DNA methylation Microarray Illumina Infinium Methylation BeadArrays 



We would like to offer our deepest gratitude to Dr. Joseph Usset and Samuel Turpin for their feedback, suggestions, and comments on this chapter.


  1. Houseman EA, Christensen BC, Yeh R-F, Marsit CJ, Karagas MR, Wrensch M, Nelson HH, Wiemels J, Zheng S, Wiencke JK, Kelsey KT. Model-based clustering of DNA methylation array data: a recursive-partitioning algorithm for high-dimensional data arising as a mixture of beta distributions. BMC Bioinform. 2008;9:365CrossRefGoogle Scholar
  2. Kuan PF, Wang S, Zhou X, Chu H. A statistical framework for illumina DNA methylation arrays. Bioinformatics. 2010;26:2849–55.CrossRefPubMedCentralPubMedGoogle Scholar
  3. Siegmund KD, Laird PW, Laird-Offringa IA. A comparison of cluster analysis methods using DNA methylation data. Bioinformatics. 2004;20:1896–904.CrossRefPubMedGoogle Scholar
  4. Koestler DC, Christensen BC, Marsit CJ, Kelsey KT, Houseman EA. Recursively partitioned mixture model clustering of DNA methylation data using biologically informed correlation structures. Stat Appl Genet Mol Biol. 2013;12:225–40.PubMedCentralPubMedGoogle Scholar
  5. Fraley C, Raftery AE. Model-based clustering, discriminant analysis and density estimation. J Am Stat Assoc. 2002;97:611–31.CrossRefGoogle Scholar
  6. Du P, Zhang X, Huang C-C, Jafari N, Kibbe WA, Hou L, Lin SM. Comparison of beta-value and M-value methods for quantifying methylation levels by microarray analysis. BMC Bioinform. 2010;11:587CrossRefGoogle Scholar
  7. Saadati M, Benner A. Statistical challenges of high-dimensional methylation data. Stat Med. 2014;33(30):5347–57CrossRefPubMedGoogle Scholar
  8. Zhuang J, Widschwendter M, Teschendorff AE. A comparison of feature selection and classification methods in DNA methylation studies using the illumina infinium platform. BMC Bioinform. 2012;13:59CrossRefGoogle Scholar
  9. Houseman EA, Accomando WP, Koestler DC, Christensen BC, Marsit CJ, Nelson HH, Wiencke JK, Kelsey KT. DNA methylation arrays as surrogate measures of cell mixture distribution. BMC Bioinform. 2012;13:86CrossRefGoogle Scholar
  10. Koestler DC, Marsit CJ, Christensen BC, Accomando W, Langevin SM, Houseman EA, Nelson HH, Karagas MR, Wiencke JK, Kelsey KT. Peripheral blood immune cell methylation profiles are associated with nonhematopoietic cancers. Cancer Epidemiol Biomark Prev. 2012;21:1293–302.CrossRefGoogle Scholar
  11. Reinius LE, Acevedo N, Joerink M, Pershagen G, Dahlén S-E, Greco D, Söderhäll C, Scheynius A, Kere J. Differential DNA methylation in purified human blood cells: implications for cell lineage and studies on disease susceptibility. PLoS One. 2012;7(7):e41361.CrossRefPubMedCentralPubMedGoogle Scholar
  12. Wang X, Laird PW, Hinoue T, Groshen S, Siegmund KD. Non-specific filtering of beta-distributed data. BMC Bioinformatics. 2014;15:199CrossRefPubMedCentralPubMedGoogle Scholar
  13. Banfield J, Raftery A. Model-based gaussian and non-gaussian clustering. Biometrics. 1993;49:803–21.CrossRefGoogle Scholar
  14. Dempster A, Laird N, Rubin D. Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Ser B (Methodological). 1977;39:1–38.Google Scholar
  15. Kaufman L, Rousseeuw P. Finding groups in data: an introduction to cluster analysis. Hoboken, New Jersey: Wiley Interscience; 1990.CrossRefGoogle Scholar
  16. Fraley C, Raftery AE. Model-based methods of classification: using the mclust software in chemometrics. J Stat Softw. 2007;18:1–13.CrossRefGoogle Scholar
  17. Schwartz G. Estimating the dimension of a model. Ann Stat. 1978;6:461–4.CrossRefGoogle Scholar
  18. Chen J. Optimal rate of convergence for finite mixture models. Ann Stat. 1995;23:221–33.CrossRefGoogle Scholar
  19. Wilhelm-Benartzi CS, Koestler DC, Karagas MR, Flanagan JM, Christensen BC, Kelsey KT, Marsit CJ, Houseman EA, Brown R. Review of processing and analysis methods for DNA methylation array data. Br J Cancer. 2013;109:1394–402.CrossRefPubMedCentralPubMedGoogle Scholar
  20. Morris TJ, Beck S. Analysis pipelines and packages for infinium humanmethylation450 beadchip (450k) data. Methods. 2014;72:3–8.CrossRefPubMedGoogle Scholar
  21. Marsit CJ, Christensen BC, Houseman EA, Karagas MR, Wrensch MR, Yeh R-F, Nelson HH, Wiemels JL, Zheng S, Posner MR, McClean MD, Wiencke JK, Kelsey KT. Epigenetic profiling reveals etiologically distinct patterns of DNA methylation in head and neck squamous cell carcinoma. Carcinogenesis. 2009;30:416–22.CrossRefPubMedCentralPubMedGoogle Scholar
  22. Hernandez-Vargas H, Lambert M-P, Le Calvez-Kelm F, Gouysse G, McKay-Chopin S, Tavtigian SV, Scoazec J-Y, Herceg Z. Hepatocellular carcinoma displays distinct DNA methylation signatures with potential as clinical predictors. PLoS One. 2010;5(3):e9749.CrossRefPubMedCentralPubMedGoogle Scholar
  23. Bibikova M, Barnes B, Tsan C, Ho V, Klotzle B, Le JM, Delano D, Zhang L, Schroth GP, Gunderson KL, Fan J-B, Shen R. High density DNA methylation array with single CpG site resolution. Genomics. 2011;98:288–95.CrossRefPubMedGoogle Scholar
  24. Merkle EC, Shaffer VA. Binary recursive partitioning: background, methods, and application to psychology. Br J Math Stat Psychol. 2011;64:161–81.CrossRefPubMedGoogle Scholar
  25. Marsit CJ, Koestler DC, Christensen BC, Karagas MR, Houseman EA, Kelsey KT. DNA methylation array analysis identifies profiles of blood-derived DNA methylation associated with bladder cancer. J Clin Oncol. 2011;29:1133–9.CrossRefPubMedCentralPubMedGoogle Scholar
  26. Langevin SM, Koestler DC, Christensen BC, Butler RA, Wiencke JK, Nelson HH, Houseman EA, Marsit CJ, Kelsey KT. Peripheral blood dna methylation profiles are indicative of head and neck squamous cell carcinoma: an epigenome-wide association study. Epigenetics. 2012;7:291–9.CrossRefPubMedCentralPubMedGoogle Scholar
  27. Cicek MS, Koestler DC, Fridley BL, Kalli KR, Armasu SM, Larson MC, Wang C, Winham SJ, Vierkant RA, Rider DN, Block MS, Klotzle B, Konecny G, Winterhoff BJ, Hamidi H, Shridhar V, Fan J-B, Visscher DW, Olson JE, Hartmann LC, Bibikova M, Chien J, Cunningham JM, Goode EL. Epigenome-wide ovarian cancer analysis identifies a methylation profile differentiating clear-cell histology with epigenetic silencing of the HERG k+ channel. Hum Mol Genet. 2013;22:3038–47.CrossRefPubMedCentralPubMedGoogle Scholar
  28. Jaccard P. Etude comparative de la distribution florale dans une portion des alpes et des jura. In Bull del la Soc Vaud des Sci Nat. 1901;37:547–79.Google Scholar
  29. Rand W. Objective criteria for the evaluation of clustering methods. J Am Stat Assoc. 1971;66:846–50.CrossRefGoogle Scholar
  30. Mallows C, Fowlkes E. A method for comparing two hierarchical clusterings. J Am Stat Assoc. 1983;78:553–69.CrossRefGoogle Scholar
  31. Hubert L, Arabie P. Comparing partitions. J Classif. 1985;2:193–218.CrossRefGoogle Scholar
  32. Milligan G, Cooper M. A study of the comparability of external criteria for hierarchical cluster analysis. Multiv Behav Res. 1986;21:441–58.CrossRefGoogle Scholar
  33. Ma S, Huang J. Penalized feature selection and classification in bioinformatics. Brief Bioinform. 2008;9:392–403.CrossRefPubMedCentralPubMedGoogle Scholar
  34. Pok G, Liu J-CS, Ryu KH. Effective feature selection framework for cluster analysis of microarray data. Bioinformation. 2010;4(8):385–9.CrossRefPubMedCentralPubMedGoogle Scholar
  35. Wei H-L, Billings SA. Feature subset selection and ranking for data dimensionality reduction. IEEE Trans Pattern Anal Mach Intell. 2007;29:162–6.CrossRefPubMedGoogle Scholar
  36. Luo Y, Wong C-J, Kaz AM, Dzieciatkowski S, Carter KT, Morris SM, Wang J, Willis JE, Makar KW, Ulrich CM, Lutterbaugh JD, Shrubsole MJ, Zheng W, Markowitz SD, Grady WM. Differences in DNA methylation signatures reveal multiple pathways of progression from adenoma to colorectal cancer. Gastroenterology. 2014;147:418–29.e8.Google Scholar
  37. Wockner LF, Noble EP, Lawford BR, Young RM, Morris CP, Whitehall VLJ, Voisey J. Genome-wide DNA methylation analysis of human brain tissue from schizophrenia patients. Trans Psychiatry. 2014;4:e339.CrossRefGoogle Scholar
  38. Milani L, Lundmark A, Kiialainen A, Nordlund J, Flaegstad T, Forestier E, Heyman M, Jonmundsson G, Kanerva J, Schmiegelow K, Söderhäll S, Gustafsson MG, Lönnerholm G, Syvänen A-C. DNA methylation for subtype classification and prediction of treatment outcome in patients with childhood acute lymphoblastic leukemia. Blood. 2010;115:1214–25.CrossRefPubMedGoogle Scholar
  39. Pacheco SE, Houseman EA, Christensen BC, Marsit CJ, Kelsey KT, Sigman M, Boekelheide K. Integrative DNA methylation and gene expression analyses identify DNA packaging and epigenetic regulatory genes associated with low motility sperm. PLoS One. 2011;6(6):e20280.CrossRefPubMedCentralPubMedGoogle Scholar
  40. Koestler DC, Marsit CJ, Christensen BC, Karagas MR, Bueno R, Sugarbaker DJ, Kelsey KT, Houseman EA. Semi-supervised recursively partitioned mixture models for identifying cancer subtypes. Bioinformatics. 2010;26:2578–85.CrossRefPubMedCentralPubMedGoogle Scholar
  41. Bair E, Tibshirani R. Semi-supervised methods to predict patient survival from gene expression data. PLoS Biol. 2004;2:E108.CrossRefPubMedCentralPubMedGoogle Scholar
  42. Witten DM, Tibshirani R. A framework for feature selection in clustering. J Am Stat Assoc. 2010;105:713–26.CrossRefPubMedCentralPubMedGoogle Scholar
  43. Futreal PA, Coin L, Marshall M, Down T, Hubbard T, Wooster R, Rahman N, Stratton MR. A census of human cancer genes. Nat Rev Cancer. 2004;4:177–83.CrossRefPubMedCentralPubMedGoogle Scholar

Copyright information

© Springer Science+Business Media Dordrecht 2015

Authors and Affiliations

  1. 1.Department of BiostatisticsUniversity of Kansas Medical CenterKansas CityUSA
  2. 2.Department of Public HealthOregon State UniversityCorvallisUSA

Personalised recommendations