Abstract
Clustering refers to the “grouping” of observations into a discrete set of classes, such that observations in the same class are more similar compared to objects between classes. In the context of DNA methylation data, clustering can be used to discover novel molecular subtypes or to identify biological pathways comprised of co-methylated CpG dinucleotides, depending on whether the samples or the CpGs themselves are being clustered. In this chapter, we focus on the problem of clustering samples/subjects on the basis of their methylation profile. We begin by discussing the motivation behind clustering DNA methylation data, the nature of DNA methylation data generated from the Illumina BeadArrays, and three promising model-based clustering methods. In addition to providing a methodological overview of each of the three methods, we also demonstrate their application using a publicly available data set deposited in the Gene Expression Omnibus (GEO) database. Issues such as feature selection and comparison of clustering partitions will also be discussed.
Keywords
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
- 1.
Details regarding the specification of the computing resources used for estimating computational times can be found at http://www.acf.ku.edu/wiki/.
References
Houseman EA, Christensen BC, Yeh R-F, Marsit CJ, Karagas MR, Wrensch M, Nelson HH, Wiemels J, Zheng S, Wiencke JK, Kelsey KT. Model-based clustering of DNA methylation array data: a recursive-partitioning algorithm for high-dimensional data arising as a mixture of beta distributions. BMC Bioinform. 2008;9:365
Kuan PF, Wang S, Zhou X, Chu H. A statistical framework for illumina DNA methylation arrays. Bioinformatics. 2010;26:2849–55.
Siegmund KD, Laird PW, Laird-Offringa IA. A comparison of cluster analysis methods using DNA methylation data. Bioinformatics. 2004;20:1896–904.
Koestler DC, Christensen BC, Marsit CJ, Kelsey KT, Houseman EA. Recursively partitioned mixture model clustering of DNA methylation data using biologically informed correlation structures. Stat Appl Genet Mol Biol. 2013;12:225–40.
Fraley C, Raftery AE. Model-based clustering, discriminant analysis and density estimation. J Am Stat Assoc. 2002;97:611–31.
Du P, Zhang X, Huang C-C, Jafari N, Kibbe WA, Hou L, Lin SM. Comparison of beta-value and M-value methods for quantifying methylation levels by microarray analysis. BMC Bioinform. 2010;11:587
Saadati M, Benner A. Statistical challenges of high-dimensional methylation data. Stat Med. 2014;33(30):5347–57
Zhuang J, Widschwendter M, Teschendorff AE. A comparison of feature selection and classification methods in DNA methylation studies using the illumina infinium platform. BMC Bioinform. 2012;13:59
Houseman EA, Accomando WP, Koestler DC, Christensen BC, Marsit CJ, Nelson HH, Wiencke JK, Kelsey KT. DNA methylation arrays as surrogate measures of cell mixture distribution. BMC Bioinform. 2012;13:86
Koestler DC, Marsit CJ, Christensen BC, Accomando W, Langevin SM, Houseman EA, Nelson HH, Karagas MR, Wiencke JK, Kelsey KT. Peripheral blood immune cell methylation profiles are associated with nonhematopoietic cancers. Cancer Epidemiol Biomark Prev. 2012;21:1293–302.
Reinius LE, Acevedo N, Joerink M, Pershagen G, Dahlén S-E, Greco D, Söderhäll C, Scheynius A, Kere J. Differential DNA methylation in purified human blood cells: implications for cell lineage and studies on disease susceptibility. PLoS One. 2012;7(7):e41361.
Wang X, Laird PW, Hinoue T, Groshen S, Siegmund KD. Non-specific filtering of beta-distributed data. BMC Bioinformatics. 2014;15:199
Banfield J, Raftery A. Model-based gaussian and non-gaussian clustering. Biometrics. 1993;49:803–21.
Dempster A, Laird N, Rubin D. Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Ser B (Methodological). 1977;39:1–38.
Kaufman L, Rousseeuw P. Finding groups in data: an introduction to cluster analysis. Hoboken, New Jersey: Wiley Interscience; 1990.
Fraley C, Raftery AE. Model-based methods of classification: using the mclust software in chemometrics. J Stat Softw. 2007;18:1–13.
Schwartz G. Estimating the dimension of a model. Ann Stat. 1978;6:461–4.
Chen J. Optimal rate of convergence for finite mixture models. Ann Stat. 1995;23:221–33.
Wilhelm-Benartzi CS, Koestler DC, Karagas MR, Flanagan JM, Christensen BC, Kelsey KT, Marsit CJ, Houseman EA, Brown R. Review of processing and analysis methods for DNA methylation array data. Br J Cancer. 2013;109:1394–402.
Morris TJ, Beck S. Analysis pipelines and packages for infinium humanmethylation450 beadchip (450k) data. Methods. 2014;72:3–8.
Marsit CJ, Christensen BC, Houseman EA, Karagas MR, Wrensch MR, Yeh R-F, Nelson HH, Wiemels JL, Zheng S, Posner MR, McClean MD, Wiencke JK, Kelsey KT. Epigenetic profiling reveals etiologically distinct patterns of DNA methylation in head and neck squamous cell carcinoma. Carcinogenesis. 2009;30:416–22.
Hernandez-Vargas H, Lambert M-P, Le Calvez-Kelm F, Gouysse G, McKay-Chopin S, Tavtigian SV, Scoazec J-Y, Herceg Z. Hepatocellular carcinoma displays distinct DNA methylation signatures with potential as clinical predictors. PLoS One. 2010;5(3):e9749.
Bibikova M, Barnes B, Tsan C, Ho V, Klotzle B, Le JM, Delano D, Zhang L, Schroth GP, Gunderson KL, Fan J-B, Shen R. High density DNA methylation array with single CpG site resolution. Genomics. 2011;98:288–95.
Merkle EC, Shaffer VA. Binary recursive partitioning: background, methods, and application to psychology. Br J Math Stat Psychol. 2011;64:161–81.
Marsit CJ, Koestler DC, Christensen BC, Karagas MR, Houseman EA, Kelsey KT. DNA methylation array analysis identifies profiles of blood-derived DNA methylation associated with bladder cancer. J Clin Oncol. 2011;29:1133–9.
Langevin SM, Koestler DC, Christensen BC, Butler RA, Wiencke JK, Nelson HH, Houseman EA, Marsit CJ, Kelsey KT. Peripheral blood dna methylation profiles are indicative of head and neck squamous cell carcinoma: an epigenome-wide association study. Epigenetics. 2012;7:291–9.
Cicek MS, Koestler DC, Fridley BL, Kalli KR, Armasu SM, Larson MC, Wang C, Winham SJ, Vierkant RA, Rider DN, Block MS, Klotzle B, Konecny G, Winterhoff BJ, Hamidi H, Shridhar V, Fan J-B, Visscher DW, Olson JE, Hartmann LC, Bibikova M, Chien J, Cunningham JM, Goode EL. Epigenome-wide ovarian cancer analysis identifies a methylation profile differentiating clear-cell histology with epigenetic silencing of the HERG k+ channel. Hum Mol Genet. 2013;22:3038–47.
Jaccard P. Etude comparative de la distribution florale dans une portion des alpes et des jura. In Bull del la Soc Vaud des Sci Nat. 1901;37:547–79.
Rand W. Objective criteria for the evaluation of clustering methods. J Am Stat Assoc. 1971;66:846–50.
Mallows C, Fowlkes E. A method for comparing two hierarchical clusterings. J Am Stat Assoc. 1983;78:553–69.
Hubert L, Arabie P. Comparing partitions. J Classif. 1985;2:193–218.
Milligan G, Cooper M. A study of the comparability of external criteria for hierarchical cluster analysis. Multiv Behav Res. 1986;21:441–58.
Ma S, Huang J. Penalized feature selection and classification in bioinformatics. Brief Bioinform. 2008;9:392–403.
Pok G, Liu J-CS, Ryu KH. Effective feature selection framework for cluster analysis of microarray data. Bioinformation. 2010;4(8):385–9.
Wei H-L, Billings SA. Feature subset selection and ranking for data dimensionality reduction. IEEE Trans Pattern Anal Mach Intell. 2007;29:162–6.
Luo Y, Wong C-J, Kaz AM, Dzieciatkowski S, Carter KT, Morris SM, Wang J, Willis JE, Makar KW, Ulrich CM, Lutterbaugh JD, Shrubsole MJ, Zheng W, Markowitz SD, Grady WM. Differences in DNA methylation signatures reveal multiple pathways of progression from adenoma to colorectal cancer. Gastroenterology. 2014;147:418–29.e8.
Wockner LF, Noble EP, Lawford BR, Young RM, Morris CP, Whitehall VLJ, Voisey J. Genome-wide DNA methylation analysis of human brain tissue from schizophrenia patients. Trans Psychiatry. 2014;4:e339.
Milani L, Lundmark A, Kiialainen A, Nordlund J, Flaegstad T, Forestier E, Heyman M, Jonmundsson G, Kanerva J, Schmiegelow K, Söderhäll S, Gustafsson MG, Lönnerholm G, Syvänen A-C. DNA methylation for subtype classification and prediction of treatment outcome in patients with childhood acute lymphoblastic leukemia. Blood. 2010;115:1214–25.
Pacheco SE, Houseman EA, Christensen BC, Marsit CJ, Kelsey KT, Sigman M, Boekelheide K. Integrative DNA methylation and gene expression analyses identify DNA packaging and epigenetic regulatory genes associated with low motility sperm. PLoS One. 2011;6(6):e20280.
Koestler DC, Marsit CJ, Christensen BC, Karagas MR, Bueno R, Sugarbaker DJ, Kelsey KT, Houseman EA. Semi-supervised recursively partitioned mixture models for identifying cancer subtypes. Bioinformatics. 2010;26:2578–85.
Bair E, Tibshirani R. Semi-supervised methods to predict patient survival from gene expression data. PLoS Biol. 2004;2:E108.
Witten DM, Tibshirani R. A framework for feature selection in clustering. J Am Stat Assoc. 2010;105:713–26.
Futreal PA, Coin L, Marshall M, Down T, Hubbard T, Wooster R, Rahman N, Stratton MR. A census of human cancer genes. Nat Rev Cancer. 2004;4:177–83.
Acknowledgements
We would like to offer our deepest gratitude to Dr. Joseph Usset and Samuel Turpin for their feedback, suggestions, and comments on this chapter.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer Science+Business Media Dordrecht
About this chapter
Cite this chapter
Koestler, D.C., Houseman, E.A. (2015). Model-Based Clustering of DNA Methylation Array Data. In: Teschendorff, A. (eds) Computational and Statistical Epigenomics. Translational Bioinformatics, vol 7. Springer, Dordrecht. https://doi.org/10.1007/978-94-017-9927-0_5
Download citation
DOI: https://doi.org/10.1007/978-94-017-9927-0_5
Publisher Name: Springer, Dordrecht
Print ISBN: 978-94-017-9926-3
Online ISBN: 978-94-017-9927-0
eBook Packages: Biomedical and Life SciencesBiomedical and Life Sciences (R0)