Classifying Big DNA Methylation Data: A Gene-Oriented Approach

  • Emanuel WeitschekEmail author
  • Fabio Cumbo
  • Eleonora Cappelli
  • Giovanni Felici
  • Paola Bertolazzi
Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 903)


Thanks to Next Generation Sequencing (NGS) techniques, public available genomic data of cancer is growing quickly. Indeed, the largest public database of cancer called The Cancer Genome Atlas (TCGA) contains huge amounts of biomedical big data to be analyzed with advanced knowledge extraction methods. In this work, we focus on the NGS experiment of DNA methylation, whose data matrices are composed of hundred thousands of features (i.e., methylated sites). We propose an efficient data processing procedure that permits to obtain a gene-oriented organization and enables to perform a supervised machine learning analysis with state-of-the-art methods. The procedure divides the original data matrices into several sub-matrices, each one containing the sites located within the same gene. We extract from TCGA DNA methylation data of three tumor types (i.e., breast, prostate, and thyroid carcinomas) and we are able to successfully discriminate tumoral from non tumoral samples using function-, tree-, and rule-based classifiers. Finally, we select the best performing genes (matrices) ranking them according to the accuracy of the classifiers and we execute an enrichment analysis of them. Those genes can be further investigated by domain experts for proving their relation to the cancers under study.


Classification DNA methylation Cancer 


  1. 1.
  2. 2.
    Bird, A.: DNA methylation patterns and epigenetic memory. Genes Dev. 16(1), 6–21 (2002)CrossRefGoogle Scholar
  3. 3.
    Bird, A.P.: CpG-rich islands and the function of DNA methylation. Nature 321(6067), 209–213 (1985)CrossRefGoogle Scholar
  4. 4.
    Celli, F., Cumbo, F., Weitschek, E.: Classification of large DNA methylation datasets for identifying cancer drivers. Big Data Res. (2018).
  5. 5.
    Cestarelli, V., Fiscon, G., Felici, G., Bertolazzi, P., Weitschek, E.: CAMUR: knowledge extraction from RNA-Seq cancer data through equivalent classification rules. Bioinformatics 32(5), 697–704 (2016)CrossRefGoogle Scholar
  6. 6.
    Cohen, W.W.: Fast effective rule induction. In: Proceedings of the Twelfth International Conference on Machine Learning, pp. 115–123 (1995)Google Scholar
  7. 7.
    Conrad, D.F., et al.: Origins and functional impact of copy number variation in the human genome. Nature 464(7289), 704–712 (2010)CrossRefGoogle Scholar
  8. 8.
    Cristianini, N., Shawe-Taylor, J.: An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods. Cambridge University Press, Cambridge (2000)CrossRefGoogle Scholar
  9. 9.
    Cumbo, F., Fiscon, G., Ceri, S., Masseroli, M., Weitschek, E.: TCGA2BED: extracting, extending, integrating, and querying the cancer genome atlas. BMC Bioinform. 18(1), 6 (2017)CrossRefGoogle Scholar
  10. 10.
    Downing, J.R., et al.: The pediatric cancer genome project. Nat. Genet. 44(6), 619–622 (2012)CrossRefGoogle Scholar
  11. 11.
    Du, P., et al.: Comparison of Beta-value and M-value methods for quantifying methylation levels by microarray analysis. BMC Bioinform. 11(1), 587 (2010)CrossRefGoogle Scholar
  12. 12.
    Enal Razvi, P.: Next-generation sequencing translating from research towards clinical utility: products in the space and market trends (2013). Accessed Feb 2015
  13. 13.
    Handel, A.E., Ebers, G.C., Ramagopalan, S.V.: Epigenetics: molecular mechanisms and implications for disease. Trends Mol. Med. 16(1), 7–16 (2010)CrossRefGoogle Scholar
  14. 14.
    Hayden, E.C.: Technology: the $1,000 genome. Nature 507(7492), 294–5 (2014)CrossRefGoogle Scholar
  15. 15.
    Hinkson, I.V., Davidsen, T.M., Klemm, J.D., Kerlavage, A.R., Kibbe, W.A.: A comprehensive infrastructure for big data in cancer research: accelerating cancer research and precision medicine. Frontiers in cell and developmental biology 5, 83 (2017)CrossRefGoogle Scholar
  16. 16.
    Jabbari, K., Bernardi, G.: Cytosine methylation and CPG, TPG (CPA) and TPA frequencies. Gene 333, 143–149 (2004)CrossRefGoogle Scholar
  17. 17.
    Jensen, M.A., Ferretti, V., Grossman, R.L., Staudt, L.M.: The NCI genomic data commons as an engine for precision medicine. Blood 130, 453–459 (2017). Scholar
  18. 18.
    Li, B., Dewey, C.N.: RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinform. 12(1), 323 (2011)CrossRefGoogle Scholar
  19. 19.
    Liggett, T., et al.: Methylation patterns of cell-free plasma DNA in relapsing-remitting multiple sclerosis. J. Neurol. Sci. 290(1), 16–21 (2010)CrossRefGoogle Scholar
  20. 20.
    Luk, S.T.C., Tong, M., Ng, K.Y., Yip, K.Y.L., Guan, X.Y., Ma, S.: Identification of ZFP42/REX1 as a regulator of cancer stemness in CD133\(^{+}\) liver cancer stem cells by genome-wide DNA methylation analysis. Nat. Genet. 77(13), 4352 (2017)Google Scholar
  21. 21.
    Maglott, D., Ostell, J., Pruitt, K.D., Tatusova, T.: Entrez gene: gene-centered information at NCBI. Nucl. Acids Res. 33(suppl. 1), D54–D58 (2005)Google Scholar
  22. 22.
    McKenna, A., et al.: The genome analysis toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20(9), 1297–1303 (2010)CrossRefGoogle Scholar
  23. 23.
    Mill, J., et al.: Epigenomic profiling reveals DNA-methylation changes associated with major psychosis. Am. J. Hum. Genet. 82(3), 696–711 (2008)CrossRefGoogle Scholar
  24. 24.
    Mortazavi, A., Williams, B.A., McCue, K., Schaeffer, L., Wold, B.: Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat. Methods 5(7), 621–628 (2008)CrossRefGoogle Scholar
  25. 25.
    Park, P.J.: Chip-Seq: advantages and challenges of a maturing technology. Nat. Rev. Genet. 10(10), 669–680 (2009)CrossRefGoogle Scholar
  26. 26.
    Polychronopoulos, D., Weitschek, E., Dimitrieva, S., Bucher, P., Felici, G., Almirantis, Y.: Classification of selectively constrained dna elements using feature vectors and rule-based classifiers. Genomics 104(2), 79–86 (2014)CrossRefGoogle Scholar
  27. 27.
    Portela, A., Esteller, M.: Epigenetic modifications and human disease. Nat. Biotechnol. 28(10), 1057–1068 (2010)CrossRefGoogle Scholar
  28. 28.
    Quinlan, J.R.: C4.5: Programs for Machine Learning. Elsevier, New York (2014)Google Scholar
  29. 29.
    Sheridan, C.: Illumina claims $1,000 genome win. Nat. Biotechnol. 32(2), 115 (2014)CrossRefGoogle Scholar
  30. 30.
    Song, J.W., Chung, K.C.: Observational studies: cohort and case-control studies. Plast. Reconstr. Surg. 126(6), 2234 (2010)CrossRefGoogle Scholar
  31. 31.
    Tan, P., Steinbach, M., Kumar, V.: Introduction to Data Mining. Addison Wesley, Boca Raton (2005)Google Scholar
  32. 32.
    Toperoff, G., et al.: Genome-wide survey reveals predisposing diabetes type 2-related DNA methylation variations in human peripheral blood. Hum. Mol. Genet. 21(2), 371–383 (2012)CrossRefGoogle Scholar
  33. 33.
    Weinstein, J.N., et al.: The cancer genome atlas pan-cancer analysis project. Nat. Genet. 45(10), 1113–1120 (2013)CrossRefGoogle Scholar
  34. 34.
    Weitschek, E., Felici, G., Bertolazzi, P.: MALA: a microarray clustering and classification software. In: Database and Expert Systems Applications (DEXA), 2012 23rd International Workshop on Biological Knowledge Discovery, pp. 201–205. IEEE (2012)Google Scholar
  35. 35.
    Weitschek, E., Felici, G., Bertolazzi, P.: Clinical data mining: problems, pitfalls and solutions. In: Database and Expert Systems Applications (DEXA) 2013, 24th International Workshop on Biological Knowledge Discovery and Data Mining, pp. 90–94. IEEE (2013)Google Scholar
  36. 36.
    Weitschek, E., Fiscon, G., Felici, G.: Supervised DNA barcodes species classification: analysis, comparisons and results. BioData Min. 7(1), 1 (2014)CrossRefGoogle Scholar
  37. 37.
    Weitschek, E., Santoni, D., Fiscon, G., De Cola, M.C., Bertolazzi, P., Felici, G.: Next generation sequencing reads comparison with an alignment-free distance. BMC Res.Notes 7(1), 869 (2014)CrossRefGoogle Scholar
  38. 38.
    Weitschek, E., Velzen, R., Felici, G., Bertolazzi, P.: Blog 2.0: a software system for character-based species classification with DNA barcode sequences. What it does, how to use it. Mol. Ecol. Resour. 13(6), 1043–1046 (2013)Google Scholar
  39. 39.
    Witten, I.H., Frank, E., Hall, M.A., Pal, C.J.: Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, San Francisco (2016)Google Scholar
  40. 40.
    Yang, X., Gao, L., Zhang, S.: Comparative pan-cancer DNA methylation analysis reveals cancer common and specific patterns. Brief. Bioinform. 18, 764–773 (2016). Scholar
  41. 41.
    Zeng, Y., Cullen, B.R.: Sequence requirements for micro RNA processing and function in human cells. RNA 9(1), 112–123 (2003)CrossRefGoogle Scholar
  42. 42.
    Zhu, Y., et al.: Quantitative and correlation analysis of the DNA methylation and expression of DAPK in breast cancer. PeerJ 5, e3084 (2017)CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  1. 1.Department of EngineeringUninettuno UniversityRomeItaly
  2. 2.Department of EngineeringRoma Tre UniversityRomeItaly
  3. 3.Institute for Systems Analysis and Computer Science, National Research CouncilRomeItaly

Personalised recommendations