Skip to main content

Perspectives of Machine Learning Techniques in Big Data Mining of Cancer

  • Chapter
  • First Online:
  • 2922 Accesses

Abstract

Advancements in cancer genomics and the emergence of personalized medicine hassle the need for decoding the genetic information obtained from various high-throughput techniques. Analysis and interpretation of the immense amount of data that gets produced from clinical samples is highly complicated and it remains as a great challenge. The future of cancer medical discoveries will mostly depend on our ability to process and analyze large genomic data sets by relating the profiles of the cancer genome to direct rational and personalized cancer therapeutics. Therefore, it necessitates the integrative approaches of big data mining to handle this large-scale genomic data, to deal with high complexity somatic genomic alterations in cancer genomes and to determine the etiology of a disease to determine drug targets. This demands the progression of robust methods in order to interrogate the functional process of various genes identified by different genomics efforts. This might be useful to understand the modern trends and strategies of the fast evolving cancer genomics research. In the recent years, parallel, incremental, and multi-view machine learning algorithms have been proposed. This chapter addresses the perspectives of machine learning algorithms in cancer genomics and gives an overview of state-of-the-art techniques in this field.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   149.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   199.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD   199.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Alexandrov LB, Nik-Zainal S, Wedge DC, Aparicio SA, Behjati S, et al. (2013) Signatures of mutational processes in human cancer. Nature 500(7463): 415–421.

    Google Scholar 

  2. Athey BD, Braxenthaler M, Haas M, Guo Y. (2013) tranSMART: an open source and community-driven informatics and data sharing platform for clinical and translational research. AMIA Jt Summits Transl Sci Proc. 6-8.

    Google Scholar 

  3. Ayer T, Alagoz O, Chhatwal J, Shavlik JW, Kahn CE, Burnside ES. (2010) Breast cancer risk estimation with artificial neural networks revisited. Cancer. 116:3310–21.

    Google Scholar 

  4. Bagyamathi M and Inbarani HH. (2015) A novel hybridized rough set and improved harmony search based feature selection for protein sequence classification. Big Data in Complex Systems. Springer. 173–204.

    Google Scholar 

  5. Bamford S, Dawson E, Forbes S, Clements J, Pettett R, Dogan A, Flanagan A, Teague J, Futreal PA, Stratton MR, Wooster R. (2004) The COSMIC (Catalogue of Somatic Mutations in Cancer) database and website. Br J Cancer. 91(2): 355–358.

    Google Scholar 

  6. Barbu A, She Y, Ding L, and Gramajo G. (2013) Feature selection with annealing for big data learning. arXiv preprint:1310. 2880.

    Google Scholar 

  7. Berkhin P. (2006) A survey of clustering data mining techniques. Grouping multidimensional data. Springer. 25–71.

    Google Scholar 

  8. Berman JJ. (2013) Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information. Elsevier.

    Google Scholar 

  9. Cerami E, Gao J, Dogrusoz U, Gross BE, Sumer SO, Aksoy BA, Jacobsen A, Byrne CJ, Heuer ML, Larsson E, Antipin Y, Reva B, Goldberg AP, Sander C, Schultz N. (2012) The cBio cancer genomics portal: an open platform for exploring multidimensional cancer genomics data. Cancer Discov. 2(5): 401–404

    Google Scholar 

  10. Chang YJ, Chen CC, Chen CL, Ho JM. (2012) A de novo next generation genomic sequence assembler based on string graph and MapReduce cloud computing framework. BMC Genomics. 13 (Suppl 7):S28

    Google Scholar 

  11. Chen J, Qian F, Yan W, Shen B (2013) Translational biomedical informatics in the cloud: present and future. BioMed Res Int. 2013:1-8

    Google Scholar 

  12. Chong Z, Ruan J, Wu CI. (2012) Rainbow: an integrated tool for efficient clustering and assembling RAD-seq reads. Bioinformatics. 28(21):2732-7.

    Google Scholar 

  13. Colosimo ME, Peterson MW, Mardis S, Hirschman L. (2011) Nephele: genotyping via complete composition vectors and MapReduce. Source Code Biol Med. 6:13

    Google Scholar 

  14. Cruz JA, Wishart DS. (2006) Applications of machine learning in cancer prediction and prognosis. Cancer Informat. 2:59.

    Google Scholar 

  15. Davis AP, Murphy CG, Johnson R, Lay JM, Lennon-Hopkins K, Saraceni-Richards C, Sciaky D, King BL, Rosenstein MC, Wiegers TC, Mattingly CJ. (2013) The Comparative Toxicogenomics Database: update 2013. Nucleic Acids Res. 41(Database issue):D1104-14.

    Google Scholar 

  16. Dean J and Ghemawat S. (2008) Mapreduce: simplified data processing on large clusters. Communications of the ACM, 51(1): 107–113.

    Google Scholar 

  17. Delen D, Walker G, Kadam A. (2005) Predicting breast cancer survivability: a comparison of three data mining methods. Artif Intell Med. 34:113–27.

    Google Scholar 

  18. Eshlaghy AT, Poorebrahimi A, Ebrahimi M, Razavi AR, Ahmad LG. (2013) Using three machine learning techniques for predicting breast cancer recurrence. J Health Med Inform 4:124.

    Google Scholar 

  19. Farhangfar A, Kurgan L, Dy J (2008) Impact of imputation of missing values on classification error for discrete data. Pattern Recognition 41: 3692-3705

    Google Scholar 

  20. Fusaro V, Patil P, Gafni E, Wall D, Tonellato P. (2011) Biomedical cloud computing with amazon web services. PLOS Comput Biol. 7(8):e1002147

    Google Scholar 

  21. Goecks J, Nekrutenko A, Taylor J (2010) Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biol. 11(8): R86.

    Google Scholar 

  22. Hamosh A, Scott AF, Amberger JS, Bocchini CA, McKusick VA (2005) Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders, Nucleic Acids Res. 33: D514-517.

    Google Scholar 

  23. Harris MA, Clark J, Ireland A, Lomax J, Ashburner M, Foulger R, et al. (2004) The Gene Ontology (GO) database and informatics resource. Nucleic Acids Res. 32: D258-61.

    Google Scholar 

  24. Hong D, Rhie A, Park SS, Lee J, Ju YS, Kim S, et al. (2012) FX: an RNA-Seq analysis tool on the cloud. Bioinformatics. 28: 721–723

    Google Scholar 

  25. Hsu W, Markey MK and Wang MD. (2013) Biomedical imaging informatics in the era of precision medicine: progress, challenges, and opportunities. J Am Med Inform Assoc. 20(6): 1010–1013.

    Google Scholar 

  26. Huang da W, Sherman BT, Lempicki RA (2009) Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources, Nat. Protoc. 4:44–57.

    Google Scholar 

  27. Huang HL, Tata S, Prill RJ. (2013) BlueSNP: R package for highly scalable genome-wide association studies using Hadoop clusters. Bioinformatics, 29:135–136

    Google Scholar 

  28. Jourdren L, Bernard M, Dillies MA, Le Crom S (2012) Eoulsan: a cloud computing-based framework facilitating high throughput sequencing analyses. Bioinformatics. 28(11):1542-3

    Google Scholar 

  29. Kantardzic M (2003) Data Mining - Concepts, Models, Methods, and Algorithms, IEEE. 165-176.

    Google Scholar 

  30. Kasprzyk A (2011) BioMart: driving a paradigm change in biological data management. Database (Oxford) 2011: bar049.

    Google Scholar 

  31. Kaufman L and Rousseeuw PJ (2005) Finding groups in data. An introduction to cluster analysis, Wiley Series in Probability and Statistics, New York. 1-368

    Google Scholar 

  32. Kellisa M, Wold B, Snyderd MP, Bernsteinb BE et al. (2014) Defining functional DNA elements in the human genome. 111(17): 6131–6138.

    Google Scholar 

  33. Keshava Prasad TS, Goel R, Kandasamy K, Keerthikumar S, Kumar S, et al. (2009) Human Protein Reference Database--2009 update, Nucleic Acids Res. 37: D767-772.

    Google Scholar 

  34. Kim J, Shin H. (2013) Breast cancer survivability prediction using labeled, unlabeled, and pseudo-labeled patient data. J Am Med Inform Assoc. 20:613–8.

    Google Scholar 

  35. Krampis K, Booth T, Chapman B, Tiwari B, Bicak M, Field D, et al. (2012) Cloud BioLinux: pre-configured and on-demand bioinformatics computing for the genomics community. BMC Bioinformatics. 13:42.

    Google Scholar 

  36. López M and Still G. (2007) Semi-infinite programming. European Journal of Operational Research. 180(2): 491–518.

    Google Scholar 

  37. Langmead B, Hansen KD, Leek JT. (2010) Cloud-scale RNA-sequencing differential expression analysis with Myrna. Genome Biol, 11:R83.

    Google Scholar 

  38. Langmead B, Schatz MC, Lin J, Pop M, Salzberg SL. (2009) Searching for SNPs with cloud computing. Genome Biol. 10: R134.

    Google Scholar 

  39. Lappalainen I, Almeida-King J, Kumanduri V, Senf A, Spalding JD, ur-Rehman S, Saunders G, Kandasamy J. (2015) The European Genome-phenome Archive of human data consented for biomedical research. Nat Genet 47(7): 692–695.

    Google Scholar 

  40. Leinonen R, Akhtar R, Birney E, Bower L, Cerdeno-Tarraga A, et al. (2011) The European Nucleotide Archive. Nucleic Acids Res (Database issue) 39: D28–D31.

    Google Scholar 

  41. Leo S, Santoni F, Zanetti G. (2009) Biodoop: bioinformatics on hadoop. Parallel processing workshops. International Conference on ICPPW 09. 415–22.

    Google Scholar 

  42. Lewis S, Csordas A, Killcoyne S, Hermjakob H, Hoopmann MR, Moritz RL, et al. (2012) Hydra: a scalable proteomic search engine which utilizes the Hadoop distributed computing. BMC Bioinformatics. 13:324

    Google Scholar 

  43. Liu CM, Wong T, Wu E, Luo RB, Yiu SM, Li YR, et al. (2012) SOAP3: ultra-fast GPU-based parallel alignment tool for short reads. Bioinformatics. 28: 878–879

    Google Scholar 

  44. Luca Pireddu, Simone Leo, and Gianluigi Zanetti. (2011) SEAL: a distributed short read mapping and duplicate removal tool. Bioinformatics. 27(15): 2159–2160.

    Google Scholar 

  45. Madduri RK, Sulakhe D, Lacinski L, Liu B, Rodriguez A, Chard K, Dave UJ, Foster IT (2014) Experiences building Globus Genomics: a next-generation sequencing analysis service using Galaxy, Globus, and Amazon Web Services. Concurr Comput. 26(13): 2266–2279.

    Google Scholar 

  46. Mailman MD, Feolo M, Jin Y, Kimura M, Tryka K, Bagoutdinov R, Hao L, Kiang A, Paschall J, Phan L, Popova N. (2007) The NCBI dbGaP database of genotypes and phenotypes. Nature Genetics. 39(10):1181–6.

    Google Scholar 

  47. Marx V (2013) The big challenges of big data. Nature. 498(7453): 255–260.

    Google Scholar 

  48. Matsunaga A, Tsugawa M, and Fortes J. (2008) CloudBLAST: Combining MapReduce and Virtualization on Distributed Resources for Bioinformatics Applications. IEEE Fourth International Conference on eScience. 222–229.

    Google Scholar 

  49. McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, Garimella K, Altshuler D, Gabriel S, Daly M, DePristo MA. (2010) The genome analysis toolkit: a MapReduce framework for analysing next-generation DNA sequencing data. Genome Res. 20(9):1297-303

    Google Scholar 

  50. Nekrutenko A and Taylor J. (2012) Next-generation sequencing data interpretation: enhancing reproducibility and accessibility. Nature Reviews Genetics. 13(9): 667–672.

    Google Scholar 

  51. Nguyen T, Shi W, Ruden D (2011) CloudAligner: a fast and full-featured MapReduce based tool for sequence mapping, BMC Res Notes. 4: 171

    Google Scholar 

  52. Niemenmaa M, Kallio A, Schumacher A, Klemela P, Korpelainen E, Heljanko K. (2012) Hadoop-BAM: directly manipulating next generation sequencing data in the cloud. Bioinformatics. 28(6):876-7

    Google Scholar 

  53. O’Connor BD, Merriman B, Nelson BF. (2010) SeqWare Query Engine: storing and searching sequence data in the cloud. BMC Bioinform, 11(12):1

    Google Scholar 

  54. Ovsiannikov M, Rus S, Reeves D, Sutter P, Rao S, and Kelly J. (2013) The quantcast file system. Proceedings of the VLDB Endowment. 6(11): 1092–1101.

    Google Scholar 

  55. Owen S, Anil R, Dunning T, and Friedman E. (2011) Mahout in action. Manning. 145–182

    Google Scholar 

  56. Ram Vinay Pandey and Christian Schlötterer. (2013) DistMap: A Toolkit for Distributed Short Read Mapping on a Hadoop Cluster. PLoS One. 8(8): e72614.

    Google Scholar 

  57. Ren X, Wang Y, Zhang X-S, Jin Q. (2013) iPcc: a novel feature extraction method for accurate disease class discovery and prediction. Nucleic Acids Res: gkt343.

    Google Scholar 

  58. Rhodes DR, Chinnaiyan AM. (2005) Integrative analysis of the cancer transcriptome. Nat Genet 37: S31-S37

    Google Scholar 

  59. Schatz M, Sommer D, Kelley D, Pop M. (2010) De Novo assembly of large genomes with cloud computing. In Proceedings of the Cold Spring Harbor Biology of Genomes.

    Google Scholar 

  60. Schatz MC. (2009) CloudBurst: highly sensitive read mapping with MapReduce. Bioinformatics. 25: 1363–1369

    Google Scholar 

  61. Schatz, M.C. (2012) Computational thinking in the era of big data biology. Genome Bio. 13: 177

    Google Scholar 

  62. Shachak A, Shuval K, Fine S. (2007) Barriers and enablers to the acceptance of bioinformatics tools: a qualitative study. J Med Libr Assoc. 95: 454–458

    Google Scholar 

  63. Shi W, Guo YF, Jin C, and Xue X (2008) An improved generalized discriminant analysis for large-scale data set. Machine Learning and Applications. ICMLA’08. Seventh International Conference on. 769 – 772.

    Google Scholar 

  64. Shvachko K, Kuang H, Radia S, and Chansler R. (2010) The hadoop distributed file system. Mass Storage Systems and Technologies (MSST) on IEEE 26th Symposium. IEEE. 1–10.

    Google Scholar 

  65. Tan M, Tsang IW, and Wang L. Towards ultrahigh dimensional feature selection for big data. (2014) The Journal of Machine Learning Research. 15(1): 1371–1429.

    Google Scholar 

  66. The Cancer Genome Atlas Research Network, Weinstein JN, Collisson EA, Mills GB, Shaw KRM, Ozenberger BA, Ellrott K, Shmulevich I, Sander C, Stuart JM. (2013) The Cancer Genome Atlas Pan-Cancer analysis project. Nat Genet 45(10): 1113–1120.

    Google Scholar 

  67. Vouzis PD, Sahinidis NV. (2011) GPU-BLAST: using graphics processors to accelerate protein sequence alignment. Bioinformatics, 27: 182–188

    Google Scholar 

  68. Wang Y, Wu Q-F, Chen C, Wu L-Y, Yan X-Z, Yu S-G, et al. (2012) Revealing metabolite biomarkers for acupuncture treatment by linear programming based feature selection. BMC Syst Biol. 6:S15.

    Google Scholar 

  69. Wilks C, Cline MS, Weiler E, Diehkans M, Craft B, Martin C, et al. (2014) The Cancer Genomics Hub (CGHub): overcoming cancer through the power of torrential data. Database (Oxford) bau093: 1-10

    Google Scholar 

  70. Zeng A, Li T, Liu D, Zhang J, and Chen H (2015) A fuzzy rough set approach for incremental feature selection on hybrid information systems. Fuzzy Sets and Systems. 258: 39–60.

    Google Scholar 

  71. Zhang T, Ramakrishnan R and Livny M (1996) Birch: an efficient data clustering method for very large databases. In ACM SIGMOD Record. 25(2): 103–114.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Subashini Swaminathan .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this chapter

Cite this chapter

Prabahar, A., Swaminathan, S. (2016). Perspectives of Machine Learning Techniques in Big Data Mining of Cancer. In: Wong, KC. (eds) Big Data Analytics in Genomics. Springer, Cham. https://doi.org/10.1007/978-3-319-41279-5_9

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-41279-5_9

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-41278-8

  • Online ISBN: 978-3-319-41279-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics