Skip to main content

Trends and Application of Data Science in Bioinformatics

  • Chapter
  • First Online:
Trends of Data Science and Applications

Part of the book series: Studies in Computational Intelligence ((SCI,volume 954 ))

Abstract

Advancement of sequencing technologies, rapid advances in omics generated an extensive volume of biological data in recent years. It requires sophisticated analytical tools to analyze and draw conclusions from such massive amount of data. Bioinformatics is an inter-disciplinary science of analyzing and interpreting biological data by application of statistics, computational methodologies, and information technology. As huge volume of genomic, proteomic, and other data is generated, analysis and interpretation of such biological data sets involves use of data science and data mining tools. Hence, researchers are required to rely increasingly on data-science tools to store and analyze the data. Data science is an inter-disciplinary science that uses algorithms and scientific methods to derive information and insights from the big data. Data science extracts scientific work out of a wide variety of subjects viz., computer science, mathematics, statistics, databases, machine learning and optimization, etc. These strategies promote investigation and advancement of innovative methods to improve the incorporation of big data and data science into biological research. Advancements in computing and data science offers viable analytical techniques for processing huge biological data. Consequently, there is a huge possibility to enhance the interaction between bioinformatics and data science. Future applications of data science should concentrate on creating high-end integrated technologies for relatively low-cost processing of enormous biological data, greater efficiency, and reliable protection measures to advance bioinformatics research.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 119.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 159.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 159.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Kumari, D., and R. Kumar. 2014. Impact of biological big data in bioinformatics. International Journal of Computer Applications 101 (11).

    Google Scholar 

  2. Venter, J.C., M.D. Adams, E.W. Myers, P.W. Li, R.J. Mural, G.G. Sutton, and J.D. Gocayne. 2001. The sequence of the human genome. Science 291 (5507): 1304–1351.

    Article  Google Scholar 

  3. Siva, N. 2008. 1000 genomes project. Nature Biotechnology 26 (3): 256.

    Article  Google Scholar 

  4. Nagaraj, K., G.S. Sharvani, and A. Sridhar. 2018. Emerging trend of big data analytics in bioinformatics: A literature review. International Journal of Bioinformatics Research and Applications 14 (1–2): 144–205.

    Article  Google Scholar 

  5. Burghard, C. 2012. Big data and analytics key to accountable care success. IDC Health insights 1–9.

    Google Scholar 

  6. Goecks, J., A. Nekrutenko, and J. Taylor. 2010. Galaxy: A comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biology 11 (8): R86.

    Article  Google Scholar 

  7. Wolstencroft, K., R. Haines, D. Fellows, A. Williams, D. Withers, S. Owen, and J. Bhagat. 2013. The Taverna workflow suite: Designing and executing workflows of Web Services on the desktop, web or in the cloud. Nucleic Acids Research 41 (W1): W557–W561.

    Article  Google Scholar 

  8. Abouelhoda, M., S. A. Issa, and Ghanem. 2012. MTavaxy: Integrating Taverna and galaxy workflows with cloud computing support. BMC Bioinformatics 13 (1): 77.

    Google Scholar 

  9. Berlin, K., S. Koren, C.S. Chin, J.P. Drake, J.M. Landolin, and A.M. Phillippy. 2015. Assembling large genomes with single-molecule sequencing and locality-sensitive hashing. Nature Biotechnology 33 (6): 623–630.

    Article  Google Scholar 

  10. Andrews, S. 2010. FastQC: A quality control tool for high throughput sequence data.

    Google Scholar 

  11. Bolger, A.M., M. Lohse, and B. Usadel. 2014. Trimmomatic: A flexible trimmer for Illumina sequence data. Bioinformatics 30 (15): 2114–2120.

    Article  Google Scholar 

  12. Li, H. 2012. seqtk Toolkit for processing sequences in FASTA/Q formats. GitHub 767: 69.

    Google Scholar 

  13. Gordon, A., and G. J. Hannon. 2010. Fastx-toolkit. FASTQ/A short reads preprocessing tools (unpublished). https://hannonlab.cshl.edu/fastx_toolkit, 5.

  14. Bankevich, A., S. Nurk, D. Antipov, A.A. Gurevich, M. Dvorkin, A.S. Kulikov, and A.V. Pyshkin. 2012. SPAdes: A new genome assembly algorithm and its applications to single-cell sequencing. Journal of Computational Biology 19 (5): 455–477.

    Article  MathSciNet  Google Scholar 

  15. Zerbino, D.R., and E. Birney. 2008. Velvet: Algorithms for de novo short read assembly using de Bruijn graphs. Genome Research 18 (5): 821–829.

    Article  Google Scholar 

  16. Simpson, J.T., K. Wong, S.D. Jackman, J.E. Schein, S.J. Jones, and I. Birol. 2009. ABySS a parallel assembler for short read sequence data. Genome Research 19 (6): 1117–1123.

    Article  Google Scholar 

  17. Luo, R., B. Liu, Y. Xie, Z. Li, W. Huang, J. Yuan, and J. Tang. 2012. SOAPdenovo2: An empirically improved memory-efficient short-read de novo assembler. Gigascience 1 (1): 2047–2217.

    Article  Google Scholar 

  18. Altschul, S.F., W. Gish, W. Miller, E.W. Myers, and D.J. Lipman. 1990. Basic local alignment search tool. Journal of Molecular Biology 215 (3): 403–410.

    Article  Google Scholar 

  19. Delcher, A.L., A. Phillippy, J. Carlton, and S.L. Salzberg. 2002. Fast algorithms for large-scale genome alignment and comparison. Nucleic Acids Research 30 (11): 2478–2483.

    Article  Google Scholar 

  20. Edgar, R.C. 2004. MUSCLE: A multiple sequence alignment method with reduced time and space complexity. BMC Bioinformatics 5 (1): 113.

    Article  Google Scholar 

  21. Jo, H., and G. Koh. 2015. Faster single-end alignment generation utilizing multi-thread for BWA. Bio-Medical Materials and Engineering 26 (s1): S1791–S1796.

    Article  Google Scholar 

  22. Langmead, B., and S.L. Salzberg. 2012. Fast gapped-read alignment with Bowtie 2. Nature Methods 9 (4): 357.

    Article  Google Scholar 

  23. Li, H., B. Handsaker, A. Wysoker, T. Fennell, J. Ruan, N. Homer, and R. Durbin. 2009. The sequence alignment/map format and SAMtools. Bioinformatics 25 (16): 2078–2079.

    Article  Google Scholar 

  24. McKenna, A., M. Hanna, E. Banks, A. Sivachenko, K. Cibulskis, A. Kernytsky, and M.A. DePristo. 2010. The Genome Analysis Toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data. Genome Research 20 (9): 1297–1303.

    Article  Google Scholar 

  25. Kumar, S., G. Stecher, and K. Tamura. 2016. MEGA7: Molecular evolutionary genetics analysis version 7.0 for bigger datasets. Molecular Biology and Evolution 33 (7): 1870–1874.

    Google Scholar 

  26. Quinlan, A.R., and I.M. Hall. 2010. BEDTools: A flexible suite of utilities for comparing genomic features. Bioinformatics 26 (6): 841–842.

    Article  Google Scholar 

  27. Esumi, S., S.X. Wu, Y. Yanagawa, K. Obata, Y. Sugimoto, and N. Tamamaki. 2008. Method for single-cell microarray analysis and application to gene-expression profiling of GABAergic neuron progenitors. Neuroscience Research 60 (4): 439–451.

    Article  Google Scholar 

  28. Tang, F., C. Barbacioru, Y. Wang, E. Nordman, C. Lee, N. Xu, and K. Lao. 2009. mRNA-Seq whole-transcriptome analysis of a single cell. Nature Methods 6 (5): 377–382.

    Article  Google Scholar 

  29. Angerer, P., L. Simon, S. Tritschler, F.A. Wolf, D. Fischer, and F.J. Theis. 2017. Single cells make big data: New challenges and opportunities in transcriptomics. Current Opinion in Systems Biology 4: 85–91.

    Article  Google Scholar 

  30. O’Driscoll, A., J. Daugelaite, and R.D. Sleator. 2013. ‘Big data’, Hadoop and cloud computing in genomics. Journal of Biomedical Informatics 46 (5): 774–781.

    Article  MATH  Google Scholar 

  31. Dolinski, K., and O.G. Troyanskaya. 2015. Implications of big data for cell biology. Molecular Biology of the Cell 26 (14): 2575–2578.

    Article  Google Scholar 

  32. Marx, V. 2013. Biology: The big challenges of big data.

    Google Scholar 

  33. Grabherr, M.G., B.J. Haas, M. Yassour, J.Z. Levin, D.A. Thompson, I. Amit, and Z. Chen. 2011. Trinity: Reconstructing a full-length transcriptome without a genome from RNA-Seq data. Nature Biotechnology 29 (7): 644.

    Article  Google Scholar 

  34. Robertson, G., J. Schein, R. Chiu, R. Corbett, M. Field, S.D. Jackman, and M. Griffith. 2010. De novo assembly and analysis of RNA-seq data. Nature Methods 7 (11): 909–912.

    Article  Google Scholar 

  35. DeLuca, D.S., J.Z. Levin, A. Sivachenko, T. Fennell, M.D. Nazaire, C. Williams, and G. Getz. 2012. RNA-SeQC: RNA-seq metrics for quality control and process optimization. Bioinformatics 28 (11): 1530–1532.

    Article  Google Scholar 

  36. Song, L., and L. Florea. 2015. Rcorrector: Efficient and accurate error correction for Illumina RNA-seq reads. GigaScience 4 (1): s13742–s14015.

    Article  Google Scholar 

  37. Robinson, M.D., D.J. McCarthy, and G.K. Smyth. 2010. edgeR: A Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26 (1): 139–140.

    Article  Google Scholar 

  38. Forster, S.C., A.M. Finkel, J.A. Gould, and P.J. Hertzog. 2013. RNA-eXpress annotates novel transcript features in RNA-seq data. Bioinformatics 29 (6): 810–812.

    Article  Google Scholar 

  39. Shi, Y., A.M. Chinnaiyan, and H. Jiang. 2015. rSeqNP: A non-parametric approach for detecting differential expression and splicing from RNA-Seq data. Bioinformatics 31 (13): 2222–2224.

    Article  Google Scholar 

  40. Swan, A. L., A. Mobasheri, D. Allaway, S. Liddell, and J. Bacardit. 2013. Application of machine learning to proteomics data: classification and biomarker identification in postgenomics biology. Omics: A Journal of Integrative Biology 17 (12): 595–610.

    Google Scholar 

  41. Bantscheff, M., M. Schirle, G. Sweetman, J. Rick, and B. Kuster. 2007. Quantitative mass spectrometry in proteomics: A critical review. Analytical and Bioanalytical Chemistry 389 (4): 1017–1031.

    Article  Google Scholar 

  42. Chalkley, R. J., P. R. Baker, L. Huang, K. C. Hansen, N. P. Allen, M. Rexach, and A. L. Burlingame. Comprehensive analysis of a multidimensional liquid chromatography mass spectrometry dataset acquired on a quadrupole selecting, quadrupole collision cell, time-of-flight mass spectrometer: II. New developments in Protein Prospector allow for reliable and comprehensive automatic analysis of large datasets. Molecular & Cellular Proteomics 4 (8): 1194–1204.

    Google Scholar 

  43. Kou, Q., L. Xun, and X. Liu. 2016. TopPIC: A software tool for top-down mass spectrometry-based proteoform identification and characterization. Bioinformatics 32 (22): 3495–3497.

    Google Scholar 

  44. Yang, X., V. Dondeti, R. Dezube, D.M. Maynard, L.Y. Geer, J. Epstein, and J.A. Kowalak. 2004. DBParser: Web-based software for shotgun proteomic data analyses. Journal of Proteome Research 3 (5): 1002–1008.

    Article  Google Scholar 

  45. Tabb, D. L., J. K. Eng, and J. R. Yates. 2001. Protein identification by SEQUEST. In Proteome Research: Mass Spectrometry, 125–142. Berlin, Heidelberg: Springer.

    Google Scholar 

  46. Sturm, M., A. Bertsch, C. Gröpl, A. Hildebrandt, R. Hussong, E. Lange, and O. Kohlbacher. 2008. OpenMS–an open-source software framework for mass spectrometry. BMC Bioinformatics 9 (1): 1–11.

    Article  Google Scholar 

  47. MacLean, B., D.M. Tomazela, N. Shulman, M. Chambers, G.L. Finney, B. Frewen, and M.J. MacCoss. 2010. Skyline: An open source document editor for creating and analyzing targeted proteomics experiments. Bioinformatics 26 (7): 966–968.

    Article  Google Scholar 

  48. Geourjon, C., and G. Deleage. 1995. SOPMA: Significant improvements in protein secondary structure prediction by consensus prediction from multiple alignments. Bioinformatics 11 (6): 681–684.

    Article  Google Scholar 

  49. Guex, N., M.C. Peitsch, and T. Schwede. 2009. Automated comparative protein structure modeling with SWISS-MODEL and Swiss-PdbViewer: A historical perspective. Electrophoresis 30 (S1): S162–S173.

    Article  Google Scholar 

  50. Combet, C., M. Jambon, G. Deleage, and C. Geourjon. 2002. Geno3D: Automatic comparative molecular modelling of protein. Bioinformatics 18 (1): 213–214.

    Article  Google Scholar 

  51. Mehrotra, B., and P. Mendes. 2006. Bioinformatics approaches to integrate metabolomics and other systems biology data. In Plant metabolomics, 105–115. Berlin, Heidelberg: Springer.

    Google Scholar 

  52. Joyce, A.R., and B.O. Palsson. 2006. The model organism as a system: Integrating’omics’ data sets. Nature Reviews Molecular Cell Biology 7 (3): 198–210.

    Article  Google Scholar 

  53. Xia, J., and D.S. Wishart. 2010. MetPA: A web-based metabolomics tool for pathway analysis and visualization. Bioinformatics 26 (18): 2342–2344.

    Article  Google Scholar 

  54. Xia, J., and D. S. Wishart. 2016. Using MetaboAnalyst 3.0 for comprehensive metabolomics data analysis. Current Protocols in Bioinformatics 55 (1): 14–10.

    Google Scholar 

  55. García-Alcalde, F., F. García-López, J. Dopazo, and A. Conesa. 2011. Paintomics: A web based tool for the joint visualization of transcriptomics and metabolomics data. Bioinformatics 27 (1): 137–139.

    Google Scholar 

  56. Kamburov, A., R. Cavill, T.M. Ebbels, R. Herwig, and H.C. Keun. 2011. Integrated pathway-level analysis of transcriptomics and metabolomics data with IMPaLA. Bioinformatics 27 (20): 2917–2918.

    Article  Google Scholar 

  57. Xia, J., T.C. Bjorndahl, P. Tang, and D.S. Wishart. 2008. MetaboMiner–semi-automated identification of metabolites from 2D NMR spectra of complex biofluids. BMC Bioinformatics 9 (1): 507.

    Article  Google Scholar 

  58. Neuweger, H., S.P. Albaum, M. Dondrup, M. Persicke, T. Watt, K. Niehaus, and A. Goesmann. 2008. MeltDB: A software platform for the analysis and integration of metabolomics experiment data. Bioinformatics 24 (23): 2726–2732.

    Article  Google Scholar 

  59. Carroll, A.J., M.R. Badger, and A.H. Millar. 2010. The MetabolomeExpress Project: Enabling web-based processing, analysis and transparent dissemination of GC/MS metabolomics datasets. BMC Bioinformatics 11 (1): 376.

    Article  Google Scholar 

  60. Kastenmüller, G., W. Römisch-Margl, B. Wägele, E. Altmaier, and K. Suhre. metaP-server: A web-based metabolomics data analysis tool. Journal of Biomedicine and Biotechnology.

    Google Scholar 

  61. Cazaly, E., J. Saad, W. Wang, C. Heckman, M. Ollikainen, and J. Tang. 2019. Making sense of the epigenome using data integration approaches. Frontiers in Pharmacology 10: 126.

    Article  Google Scholar 

  62. Holder, L.B., M.M. Haque, and M.K. Skinner. 2017. Machine learning for epigenetics and future medical applications. Epigenetics 12 (7): 505–514.

    Article  Google Scholar 

  63. Pedersen, B., T.F. Hsieh, C. Ibarra, and R.L. Fischer. 2011. MethylCoder: Software pipeline for bisulfite-treated sequences. Bioinformatics 27 (17): 2435–2436.

    Article  Google Scholar 

  64. Krueger, F., and S.R. Andrews. 2011. Bismark: A flexible aligner and methylation caller for Bisulfite-Seq applications. Bioinformatics 27 (11): 1571–1572.

    Article  Google Scholar 

  65. Harris, E.Y., N. Ponts, K.G. Le Roch, and S. Lonardi. 2012. BRAT-BW: Efficient and accurate mapping of bisulfite-treated reads. Bioinformatics 28 (13): 1795–1796.

    Article  Google Scholar 

  66. Kishore, K., S. de Pretis, R. Lister, M.J. Morelli, V. Bianchi, B. Amati, and M. Pelizzola. 2015. methylPipe and compEpiTools: A suite of R packages for the integrative analysis of epigenomics data. BMC Bioinformatics 16 (1): 313.

    Article  Google Scholar 

  67. Fang, F., S. Fan, X. Zhang, and M.Q. Zhang. 2006. Predicting methylation status of CpG islands in the human brain. Bioinformatics 22 (18): 2204–2209.

    Article  Google Scholar 

  68. Das, R., N. Dimitrova, Z. Xuan, R.A. Rollins, F. Haghighi, J.R. Edwards, and M.Q. Zhang. 2006. Computational prediction of methylation status in human genomic sequences. Proceedings of the National Academy of Sciences 103 (28): 10713–10716.

    Article  Google Scholar 

  69. Feltus, F.A., E.K. Lee, J.F. Costello, C. Plass, and P.M. Vertino. 2003. Predicting aberrant CpG island methylation. Proceedings of the National Academy of Sciences 100 (21): 12253–12258.

    Article  Google Scholar 

  70. Bock, C., J. Walter, M. Paulsen, and T. Lengauer. 2007. CpG island mapping by epigenome prediction. PLoS Computational Biology 3 (6): e110.

    Article  Google Scholar 

  71. Segal, E., Y. Fondufe-Mittendorf, L. Chen, A. Thåström, Y. Field, I.K. Moore, and J. Widom. 2006. A genomic code for nucleosome positioning. Nature 442 (7104): 772–778.

    Article  Google Scholar 

  72. Peckham, H.E., R.E. Thurman, Y. Fu, J.A. Stamatoyannopoulos, W.S. Noble, K. Struhl, and Z. Weng. 2007. Nucleosome positioning signals in genomic DNA. Genome Research 17 (8): 1170–1177.

    Article  Google Scholar 

  73. Ravi, D., C. Wong, F. Deligianni, M. Berthelot, J. Andreu-Perez, B. Lo, and G. Z. Yang. 2016. Deep learning for health informatics. IEEE Journal of Biomedical and Health Informatics 21 (1): 4–21.

    Google Scholar 

  74. Akay, A., and H. Hess. 2019. Deep learning: Current and emerging applications in medicine and technology. IEEE Journal of Biomedical and Health Informatics 23 (3): 906–920.

    Article  Google Scholar 

  75. Schmidhuber, J. 2015. Deep learning in neural networks: An overview. Neural Networks 61: 85–117.

    Article  Google Scholar 

  76. Wei, L., R. Su, B. Wang, X. Li, Q. Zou, and X. Gao. 2019. Integration of deep feature representations and handcrafted features to improve the prediction of N6-methyladenosine sites. Neurocomputing 324: 3–9.

    Article  Google Scholar 

  77. Luo, F., M. Wang, Y. Liu, X.M. Zhao, and A. Li. 2019. DeepPhos: Prediction of protein phosphorylation sites with deep learning. Bioinformatics 35 (16): 2766–2773.

    Article  Google Scholar 

  78. Goh, G.B., N.O. Hodas, and A. Vishnu. 2017. Deep learning for computational chemistry. Journal of Computational Chemistry 38 (16): 1291–1307.

    Article  Google Scholar 

  79. Fu, H., Y. Yang, X. Wang, H. Wang, and Y. Xu. 2019. DeepUbi: A deep learning framework for prediction of ubiquitination sites in proteins. BMC Bioinformatics 20 (1): 1–10.

    Article  Google Scholar 

  80. Raza, K. 2012. Application of data mining in bioinformatics. arXiv preprint arXiv:1205.1125.

  81. Jurtz, V. I., A. R. Johansen, M. Nielsen, J. J. Almagro Armenteros, H. Nielsen, C. K. Sønderby, and S. K. Sønderby. 2017. An introduction to deep learning on biological sequence data: Examples and solutions. Bioinformatics 33 (22): 3685–3690.

    Google Scholar 

  82. Rhee, S.Y., J. Dickerson, and D. Xu. 2006. Bioinformatics and its applications in plant biology. Annual Review of Plant Biology 57: 335–360.

    Article  Google Scholar 

  83. Min, S., B. Lee, and S. Yoon. 2017. Deep learning in bioinformatics. Briefings in Bioinformatics 18 (5): 851–869.

    Google Scholar 

  84. Alipanahi, B., A. Delong, M.T. Weirauch, and B.J. Frey. 2015. Predicting the sequence specificities of DNA-and RNA-binding proteins by deep learning. Nature Biotechnology 33 (8): 831–838.

    Article  Google Scholar 

  85. Stein, L.D. 2010. The case for cloud computing in genome informatics. Genome Biology 11 (5): 207.

    Article  Google Scholar 

  86. Rosenthal, A., P. Mork, M.H. Li, J. Stanford, D. Koester, and P. Reynolds. 2010. Cloud computing: A new business paradigm for biomedical information sharing. Journal of Biomedical Informatics 43 (2): 342–353.

    Article  Google Scholar 

  87. Wall, D.P., P. Kudtarkar, V.A. Fusaro, R. Pivovarov, P. Patil, and P.J. Tonellato. 2010. Cloud computing for comparative genomics. BMC Bioinformatics 11 (1): 259.

    Article  Google Scholar 

  88. Kudtarkar, P., T. F. DeLuca, V. A. Fusaro, P. J. Tonellato, and D. P. Wall. 2010. Cost-effective cloud computing: a case study using the comparative genomics tool, roundup. Evolutionary Bioinformatics 6, EBO-S6259.

    Google Scholar 

  89. Era7 Bioinformatics, https://era7bioinformatics.com.

  90. EagleGenomics, https://www.eaglegenomics.com.

  91. DNAnexus, https://dnanexus.com/.

  92. MaverixBio, https://www.maverixbio.com.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to P. Supriya .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Supriya, P., Marudamuthu, B., Soam, S.K., Rao, C.S. (2021). Trends and Application of Data Science in Bioinformatics. In: Rautaray, S.S., Pemmaraju, P., Mohanty, H. (eds) Trends of Data Science and Applications. Studies in Computational Intelligence, vol 954 . Springer, Singapore. https://doi.org/10.1007/978-981-33-6815-6_12

Download citation

  • DOI: https://doi.org/10.1007/978-981-33-6815-6_12

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-33-6814-9

  • Online ISBN: 978-981-33-6815-6

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics