Abstract
Thanks to Next Generation Sequencing (NGS) techniques, public available genomic data of cancer is growing quickly. Indeed, the largest public database of cancer called The Cancer Genome Atlas (TCGA) contains huge amounts of biomedical big data to be analyzed with advanced knowledge extraction methods. In this work, we focus on the NGS experiment of DNA methylation, whose data matrices are composed of hundred thousands of features (i.e., methylated sites). We propose an efficient data processing procedure that permits to obtain a gene-oriented organization and enables to perform a supervised machine learning analysis with state-of-the-art methods. The procedure divides the original data matrices into several sub-matrices, each one containing the sites located within the same gene. We extract from TCGA DNA methylation data of three tumor types (i.e., breast, prostate, and thyroid carcinomas) and we are able to successfully discriminate tumoral from non tumoral samples using function-, tree-, and rule-based classifiers. Finally, we select the best performing genes (matrices) ranking them according to the accuracy of the classifiers and we execute an enrichment analysis of them. Those genes can be further investigated by domain experts for proving their relation to the cancers under study.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Genomic data harmonization. https://gdc.cancer.gov/about-data/data-harmonization-and-generation/genomic-data-harmonization-0
Bird, A.: DNA methylation patterns and epigenetic memory. Genes Dev. 16(1), 6–21 (2002)
Bird, A.P.: CpG-rich islands and the function of DNA methylation. Nature 321(6067), 209–213 (1985)
Celli, F., Cumbo, F., Weitschek, E.: Classification of large DNA methylation datasets for identifying cancer drivers. Big Data Res. (2018). https://doi.org/10.1016/j.bdr.2018.02.005
Cestarelli, V., Fiscon, G., Felici, G., Bertolazzi, P., Weitschek, E.: CAMUR: knowledge extraction from RNA-Seq cancer data through equivalent classification rules. Bioinformatics 32(5), 697–704 (2016)
Cohen, W.W.: Fast effective rule induction. In: Proceedings of the Twelfth International Conference on Machine Learning, pp. 115–123 (1995)
Conrad, D.F., et al.: Origins and functional impact of copy number variation in the human genome. Nature 464(7289), 704–712 (2010)
Cristianini, N., Shawe-Taylor, J.: An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods. Cambridge University Press, Cambridge (2000)
Cumbo, F., Fiscon, G., Ceri, S., Masseroli, M., Weitschek, E.: TCGA2BED: extracting, extending, integrating, and querying the cancer genome atlas. BMC Bioinform. 18(1), 6 (2017)
Downing, J.R., et al.: The pediatric cancer genome project. Nat. Genet. 44(6), 619–622 (2012)
Du, P., et al.: Comparison of Beta-value and M-value methods for quantifying methylation levels by microarray analysis. BMC Bioinform. 11(1), 587 (2010)
Enal Razvi, P.: Next-generation sequencing translating from research towards clinical utility: products in the space and market trends (2013). GENengnews.com. Accessed Feb 2015
Handel, A.E., Ebers, G.C., Ramagopalan, S.V.: Epigenetics: molecular mechanisms and implications for disease. Trends Mol. Med. 16(1), 7–16 (2010)
Hayden, E.C.: Technology: the $1,000 genome. Nature 507(7492), 294–5 (2014)
Hinkson, I.V., Davidsen, T.M., Klemm, J.D., Kerlavage, A.R., Kibbe, W.A.: A comprehensive infrastructure for big data in cancer research: accelerating cancer research and precision medicine. Frontiers in cell and developmental biology 5, 83 (2017)
Jabbari, K., Bernardi, G.: Cytosine methylation and CPG, TPG (CPA) and TPA frequencies. Gene 333, 143–149 (2004)
Jensen, M.A., Ferretti, V., Grossman, R.L., Staudt, L.M.: The NCI genomic data commons as an engine for precision medicine. Blood 130, 453–459 (2017). https://doi.org/10.1182/blood-2017-03-735654
Li, B., Dewey, C.N.: RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinform. 12(1), 323 (2011)
Liggett, T., et al.: Methylation patterns of cell-free plasma DNA in relapsing-remitting multiple sclerosis. J. Neurol. Sci. 290(1), 16–21 (2010)
Luk, S.T.C., Tong, M., Ng, K.Y., Yip, K.Y.L., Guan, X.Y., Ma, S.: Identification of ZFP42/REX1 as a regulator of cancer stemness in CD133\(^{+}\) liver cancer stem cells by genome-wide DNA methylation analysis. Nat. Genet. 77(13), 4352 (2017)
Maglott, D., Ostell, J., Pruitt, K.D., Tatusova, T.: Entrez gene: gene-centered information at NCBI. Nucl. Acids Res. 33(suppl. 1), D54–D58 (2005)
McKenna, A., et al.: The genome analysis toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20(9), 1297–1303 (2010)
Mill, J., et al.: Epigenomic profiling reveals DNA-methylation changes associated with major psychosis. Am. J. Hum. Genet. 82(3), 696–711 (2008)
Mortazavi, A., Williams, B.A., McCue, K., Schaeffer, L., Wold, B.: Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat. Methods 5(7), 621–628 (2008)
Park, P.J.: Chip-Seq: advantages and challenges of a maturing technology. Nat. Rev. Genet. 10(10), 669–680 (2009)
Polychronopoulos, D., Weitschek, E., Dimitrieva, S., Bucher, P., Felici, G., Almirantis, Y.: Classification of selectively constrained dna elements using feature vectors and rule-based classifiers. Genomics 104(2), 79–86 (2014)
Portela, A., Esteller, M.: Epigenetic modifications and human disease. Nat. Biotechnol. 28(10), 1057–1068 (2010)
Quinlan, J.R.: C4.5: Programs for Machine Learning. Elsevier, New York (2014)
Sheridan, C.: Illumina claims $1,000 genome win. Nat. Biotechnol. 32(2), 115 (2014)
Song, J.W., Chung, K.C.: Observational studies: cohort and case-control studies. Plast. Reconstr. Surg. 126(6), 2234 (2010)
Tan, P., Steinbach, M., Kumar, V.: Introduction to Data Mining. Addison Wesley, Boca Raton (2005)
Toperoff, G., et al.: Genome-wide survey reveals predisposing diabetes type 2-related DNA methylation variations in human peripheral blood. Hum. Mol. Genet. 21(2), 371–383 (2012)
Weinstein, J.N., et al.: The cancer genome atlas pan-cancer analysis project. Nat. Genet. 45(10), 1113–1120 (2013)
Weitschek, E., Felici, G., Bertolazzi, P.: MALA: a microarray clustering and classification software. In: Database and Expert Systems Applications (DEXA), 2012 23rd International Workshop on Biological Knowledge Discovery, pp. 201–205. IEEE (2012)
Weitschek, E., Felici, G., Bertolazzi, P.: Clinical data mining: problems, pitfalls and solutions. In: Database and Expert Systems Applications (DEXA) 2013, 24th International Workshop on Biological Knowledge Discovery and Data Mining, pp. 90–94. IEEE (2013)
Weitschek, E., Fiscon, G., Felici, G.: Supervised DNA barcodes species classification: analysis, comparisons and results. BioData Min. 7(1), 1 (2014)
Weitschek, E., Santoni, D., Fiscon, G., De Cola, M.C., Bertolazzi, P., Felici, G.: Next generation sequencing reads comparison with an alignment-free distance. BMC Res.Notes 7(1), 869 (2014)
Weitschek, E., Velzen, R., Felici, G., Bertolazzi, P.: Blog 2.0: a software system for character-based species classification with DNA barcode sequences. What it does, how to use it. Mol. Ecol. Resour. 13(6), 1043–1046 (2013)
Witten, I.H., Frank, E., Hall, M.A., Pal, C.J.: Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, San Francisco (2016)
Yang, X., Gao, L., Zhang, S.: Comparative pan-cancer DNA methylation analysis reveals cancer common and specific patterns. Brief. Bioinform. 18, 764–773 (2016). https://doi.org/10.1093/bib/bbw063
Zeng, Y., Cullen, B.R.: Sequence requirements for micro RNA processing and function in human cells. RNA 9(1), 112–123 (2003)
Zhu, Y., et al.: Quantitative and correlation analysis of the DNA methylation and expression of DAPK in breast cancer. PeerJ 5, e3084 (2017)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this paper
Cite this paper
Weitschek, E., Cumbo, F., Cappelli, E., Felici, G., Bertolazzi, P. (2018). Classifying Big DNA Methylation Data: A Gene-Oriented Approach. In: Elloumi, M., et al. Database and Expert Systems Applications. DEXA 2018. Communications in Computer and Information Science, vol 903. Springer, Cham. https://doi.org/10.1007/978-3-319-99133-7_11
Download citation
DOI: https://doi.org/10.1007/978-3-319-99133-7_11
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-99132-0
Online ISBN: 978-3-319-99133-7
eBook Packages: Computer ScienceComputer Science (R0)