Skip to main content

Classifying Big DNA Methylation Data: A Gene-Oriented Approach

  • Conference paper
  • First Online:
Database and Expert Systems Applications (DEXA 2018)

Abstract

Thanks to Next Generation Sequencing (NGS) techniques, public available genomic data of cancer is growing quickly. Indeed, the largest public database of cancer called The Cancer Genome Atlas (TCGA) contains huge amounts of biomedical big data to be analyzed with advanced knowledge extraction methods. In this work, we focus on the NGS experiment of DNA methylation, whose data matrices are composed of hundred thousands of features (i.e., methylated sites). We propose an efficient data processing procedure that permits to obtain a gene-oriented organization and enables to perform a supervised machine learning analysis with state-of-the-art methods. The procedure divides the original data matrices into several sub-matrices, each one containing the sites located within the same gene. We extract from TCGA DNA methylation data of three tumor types (i.e., breast, prostate, and thyroid carcinomas) and we are able to successfully discriminate tumoral from non tumoral samples using function-, tree-, and rule-based classifiers. Finally, we select the best performing genes (matrices) ranking them according to the accuracy of the classifiers and we execute an enrichment analysis of them. Those genes can be further investigated by domain experts for proving their relation to the cancers under study.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Genomic data harmonization. https://gdc.cancer.gov/about-data/data-harmonization-and-generation/genomic-data-harmonization-0

  2. Bird, A.: DNA methylation patterns and epigenetic memory. Genes Dev. 16(1), 6–21 (2002)

    Article  Google Scholar 

  3. Bird, A.P.: CpG-rich islands and the function of DNA methylation. Nature 321(6067), 209–213 (1985)

    Article  Google Scholar 

  4. Celli, F., Cumbo, F., Weitschek, E.: Classification of large DNA methylation datasets for identifying cancer drivers. Big Data Res. (2018). https://doi.org/10.1016/j.bdr.2018.02.005

  5. Cestarelli, V., Fiscon, G., Felici, G., Bertolazzi, P., Weitschek, E.: CAMUR: knowledge extraction from RNA-Seq cancer data through equivalent classification rules. Bioinformatics 32(5), 697–704 (2016)

    Article  Google Scholar 

  6. Cohen, W.W.: Fast effective rule induction. In: Proceedings of the Twelfth International Conference on Machine Learning, pp. 115–123 (1995)

    Google Scholar 

  7. Conrad, D.F., et al.: Origins and functional impact of copy number variation in the human genome. Nature 464(7289), 704–712 (2010)

    Article  Google Scholar 

  8. Cristianini, N., Shawe-Taylor, J.: An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods. Cambridge University Press, Cambridge (2000)

    Book  Google Scholar 

  9. Cumbo, F., Fiscon, G., Ceri, S., Masseroli, M., Weitschek, E.: TCGA2BED: extracting, extending, integrating, and querying the cancer genome atlas. BMC Bioinform. 18(1), 6 (2017)

    Article  Google Scholar 

  10. Downing, J.R., et al.: The pediatric cancer genome project. Nat. Genet. 44(6), 619–622 (2012)

    Article  Google Scholar 

  11. Du, P., et al.: Comparison of Beta-value and M-value methods for quantifying methylation levels by microarray analysis. BMC Bioinform. 11(1), 587 (2010)

    Article  Google Scholar 

  12. Enal Razvi, P.: Next-generation sequencing translating from research towards clinical utility: products in the space and market trends (2013). GENengnews.com. Accessed Feb 2015

  13. Handel, A.E., Ebers, G.C., Ramagopalan, S.V.: Epigenetics: molecular mechanisms and implications for disease. Trends Mol. Med. 16(1), 7–16 (2010)

    Article  Google Scholar 

  14. Hayden, E.C.: Technology: the $1,000 genome. Nature 507(7492), 294–5 (2014)

    Article  Google Scholar 

  15. Hinkson, I.V., Davidsen, T.M., Klemm, J.D., Kerlavage, A.R., Kibbe, W.A.: A comprehensive infrastructure for big data in cancer research: accelerating cancer research and precision medicine. Frontiers in cell and developmental biology 5, 83 (2017)

    Article  Google Scholar 

  16. Jabbari, K., Bernardi, G.: Cytosine methylation and CPG, TPG (CPA) and TPA frequencies. Gene 333, 143–149 (2004)

    Article  Google Scholar 

  17. Jensen, M.A., Ferretti, V., Grossman, R.L., Staudt, L.M.: The NCI genomic data commons as an engine for precision medicine. Blood 130, 453–459 (2017). https://doi.org/10.1182/blood-2017-03-735654

    Article  Google Scholar 

  18. Li, B., Dewey, C.N.: RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinform. 12(1), 323 (2011)

    Article  Google Scholar 

  19. Liggett, T., et al.: Methylation patterns of cell-free plasma DNA in relapsing-remitting multiple sclerosis. J. Neurol. Sci. 290(1), 16–21 (2010)

    Article  Google Scholar 

  20. Luk, S.T.C., Tong, M., Ng, K.Y., Yip, K.Y.L., Guan, X.Y., Ma, S.: Identification of ZFP42/REX1 as a regulator of cancer stemness in CD133\(^{+}\) liver cancer stem cells by genome-wide DNA methylation analysis. Nat. Genet. 77(13), 4352 (2017)

    Google Scholar 

  21. Maglott, D., Ostell, J., Pruitt, K.D., Tatusova, T.: Entrez gene: gene-centered information at NCBI. Nucl. Acids Res. 33(suppl. 1), D54–D58 (2005)

    Google Scholar 

  22. McKenna, A., et al.: The genome analysis toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20(9), 1297–1303 (2010)

    Article  Google Scholar 

  23. Mill, J., et al.: Epigenomic profiling reveals DNA-methylation changes associated with major psychosis. Am. J. Hum. Genet. 82(3), 696–711 (2008)

    Article  Google Scholar 

  24. Mortazavi, A., Williams, B.A., McCue, K., Schaeffer, L., Wold, B.: Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat. Methods 5(7), 621–628 (2008)

    Article  Google Scholar 

  25. Park, P.J.: Chip-Seq: advantages and challenges of a maturing technology. Nat. Rev. Genet. 10(10), 669–680 (2009)

    Article  Google Scholar 

  26. Polychronopoulos, D., Weitschek, E., Dimitrieva, S., Bucher, P., Felici, G., Almirantis, Y.: Classification of selectively constrained dna elements using feature vectors and rule-based classifiers. Genomics 104(2), 79–86 (2014)

    Article  Google Scholar 

  27. Portela, A., Esteller, M.: Epigenetic modifications and human disease. Nat. Biotechnol. 28(10), 1057–1068 (2010)

    Article  Google Scholar 

  28. Quinlan, J.R.: C4.5: Programs for Machine Learning. Elsevier, New York (2014)

    Google Scholar 

  29. Sheridan, C.: Illumina claims $1,000 genome win. Nat. Biotechnol. 32(2), 115 (2014)

    Article  Google Scholar 

  30. Song, J.W., Chung, K.C.: Observational studies: cohort and case-control studies. Plast. Reconstr. Surg. 126(6), 2234 (2010)

    Article  Google Scholar 

  31. Tan, P., Steinbach, M., Kumar, V.: Introduction to Data Mining. Addison Wesley, Boca Raton (2005)

    Google Scholar 

  32. Toperoff, G., et al.: Genome-wide survey reveals predisposing diabetes type 2-related DNA methylation variations in human peripheral blood. Hum. Mol. Genet. 21(2), 371–383 (2012)

    Article  Google Scholar 

  33. Weinstein, J.N., et al.: The cancer genome atlas pan-cancer analysis project. Nat. Genet. 45(10), 1113–1120 (2013)

    Article  Google Scholar 

  34. Weitschek, E., Felici, G., Bertolazzi, P.: MALA: a microarray clustering and classification software. In: Database and Expert Systems Applications (DEXA), 2012 23rd International Workshop on Biological Knowledge Discovery, pp. 201–205. IEEE (2012)

    Google Scholar 

  35. Weitschek, E., Felici, G., Bertolazzi, P.: Clinical data mining: problems, pitfalls and solutions. In: Database and Expert Systems Applications (DEXA) 2013, 24th International Workshop on Biological Knowledge Discovery and Data Mining, pp. 90–94. IEEE (2013)

    Google Scholar 

  36. Weitschek, E., Fiscon, G., Felici, G.: Supervised DNA barcodes species classification: analysis, comparisons and results. BioData Min. 7(1), 1 (2014)

    Article  Google Scholar 

  37. Weitschek, E., Santoni, D., Fiscon, G., De Cola, M.C., Bertolazzi, P., Felici, G.: Next generation sequencing reads comparison with an alignment-free distance. BMC Res.Notes 7(1), 869 (2014)

    Article  Google Scholar 

  38. Weitschek, E., Velzen, R., Felici, G., Bertolazzi, P.: Blog 2.0: a software system for character-based species classification with DNA barcode sequences. What it does, how to use it. Mol. Ecol. Resour. 13(6), 1043–1046 (2013)

    Google Scholar 

  39. Witten, I.H., Frank, E., Hall, M.A., Pal, C.J.: Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, San Francisco (2016)

    Google Scholar 

  40. Yang, X., Gao, L., Zhang, S.: Comparative pan-cancer DNA methylation analysis reveals cancer common and specific patterns. Brief. Bioinform. 18, 764–773 (2016). https://doi.org/10.1093/bib/bbw063

    Article  Google Scholar 

  41. Zeng, Y., Cullen, B.R.: Sequence requirements for micro RNA processing and function in human cells. RNA 9(1), 112–123 (2003)

    Article  Google Scholar 

  42. Zhu, Y., et al.: Quantitative and correlation analysis of the DNA methylation and expression of DAPK in breast cancer. PeerJ 5, e3084 (2017)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Emanuel Weitschek .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Weitschek, E., Cumbo, F., Cappelli, E., Felici, G., Bertolazzi, P. (2018). Classifying Big DNA Methylation Data: A Gene-Oriented Approach. In: Elloumi, M., et al. Database and Expert Systems Applications. DEXA 2018. Communications in Computer and Information Science, vol 903. Springer, Cham. https://doi.org/10.1007/978-3-319-99133-7_11

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-99133-7_11

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-99132-0

  • Online ISBN: 978-3-319-99133-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics