Trends and Application of Data Science in Bioinformatics

Supriya, P.; Marudamuthu, Balakrishnan; Soam, Sudhir Kumar; Rao, Cherukumalli Srinivasa

doi:10.1007/978-981-33-6815-6_12

P. Supriya⁵,
Balakrishnan Marudamuthu⁵,
Sudhir Kumar Soam⁵ &
…
Cherukumalli Srinivasa Rao⁵

Part of the book series: Studies in Computational Intelligence ((SCI,volume 954 ))

1062 Accesses
2 Citations

Abstract

Advancement of sequencing technologies, rapid advances in omics generated an extensive volume of biological data in recent years. It requires sophisticated analytical tools to analyze and draw conclusions from such massive amount of data. Bioinformatics is an inter-disciplinary science of analyzing and interpreting biological data by application of statistics, computational methodologies, and information technology. As huge volume of genomic, proteomic, and other data is generated, analysis and interpretation of such biological data sets involves use of data science and data mining tools. Hence, researchers are required to rely increasingly on data-science tools to store and analyze the data. Data science is an inter-disciplinary science that uses algorithms and scientific methods to derive information and insights from the big data. Data science extracts scientific work out of a wide variety of subjects viz., computer science, mathematics, statistics, databases, machine learning and optimization, etc. These strategies promote investigation and advancement of innovative methods to improve the incorporation of big data and data science into biological research. Advancements in computing and data science offers viable analytical techniques for processing huge biological data. Consequently, there is a huge possibility to enhance the interaction between bioinformatics and data science. Future applications of data science should concentrate on creating high-end integrated technologies for relatively low-cost processing of enormous biological data, greater efficiency, and reliable protection measures to advance bioinformatics research.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 119.00; Price excludes VAT (USA)

Softcover Book: USD 159.99; Price excludes VAT (USA)

Hardcover Book: USD 159.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Kumari, D., and R. Kumar. 2014. Impact of biological big data in bioinformatics. International Journal of Computer Applications 101 (11).
Google Scholar
Venter, J.C., M.D. Adams, E.W. Myers, P.W. Li, R.J. Mural, G.G. Sutton, and J.D. Gocayne. 2001. The sequence of the human genome. Science 291 (5507): 1304–1351.
Article Google Scholar
Siva, N. 2008. 1000 genomes project. Nature Biotechnology 26 (3): 256.
Article Google Scholar
Nagaraj, K., G.S. Sharvani, and A. Sridhar. 2018. Emerging trend of big data analytics in bioinformatics: A literature review. International Journal of Bioinformatics Research and Applications 14 (1–2): 144–205.
Article Google Scholar
Burghard, C. 2012. Big data and analytics key to accountable care success. IDC Health insights 1–9.
Google Scholar
Goecks, J., A. Nekrutenko, and J. Taylor. 2010. Galaxy: A comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biology 11 (8): R86.
Article Google Scholar
Wolstencroft, K., R. Haines, D. Fellows, A. Williams, D. Withers, S. Owen, and J. Bhagat. 2013. The Taverna workflow suite: Designing and executing workflows of Web Services on the desktop, web or in the cloud. Nucleic Acids Research 41 (W1): W557–W561.
Article Google Scholar
Abouelhoda, M., S. A. Issa, and Ghanem. 2012. MTavaxy: Integrating Taverna and galaxy workflows with cloud computing support. BMC Bioinformatics 13 (1): 77.
Google Scholar
Berlin, K., S. Koren, C.S. Chin, J.P. Drake, J.M. Landolin, and A.M. Phillippy. 2015. Assembling large genomes with single-molecule sequencing and locality-sensitive hashing. Nature Biotechnology 33 (6): 623–630.
Article Google Scholar
Andrews, S. 2010. FastQC: A quality control tool for high throughput sequence data.
Google Scholar
Bolger, A.M., M. Lohse, and B. Usadel. 2014. Trimmomatic: A flexible trimmer for Illumina sequence data. Bioinformatics 30 (15): 2114–2120.
Article Google Scholar
Li, H. 2012. seqtk Toolkit for processing sequences in FASTA/Q formats. GitHub 767: 69.
Google Scholar
Gordon, A., and G. J. Hannon. 2010. Fastx-toolkit. FASTQ/A short reads preprocessing tools (unpublished). https://hannonlab.cshl.edu/fastx_toolkit, 5.
Bankevich, A., S. Nurk, D. Antipov, A.A. Gurevich, M. Dvorkin, A.S. Kulikov, and A.V. Pyshkin. 2012. SPAdes: A new genome assembly algorithm and its applications to single-cell sequencing. Journal of Computational Biology 19 (5): 455–477.
Article MathSciNet Google Scholar
Zerbino, D.R., and E. Birney. 2008. Velvet: Algorithms for de novo short read assembly using de Bruijn graphs. Genome Research 18 (5): 821–829.
Article Google Scholar
Simpson, J.T., K. Wong, S.D. Jackman, J.E. Schein, S.J. Jones, and I. Birol. 2009. ABySS a parallel assembler for short read sequence data. Genome Research 19 (6): 1117–1123.
Article Google Scholar
Luo, R., B. Liu, Y. Xie, Z. Li, W. Huang, J. Yuan, and J. Tang. 2012. SOAPdenovo2: An empirically improved memory-efficient short-read de novo assembler. Gigascience 1 (1): 2047–2217.
Article Google Scholar
Altschul, S.F., W. Gish, W. Miller, E.W. Myers, and D.J. Lipman. 1990. Basic local alignment search tool. Journal of Molecular Biology 215 (3): 403–410.
Article Google Scholar
Delcher, A.L., A. Phillippy, J. Carlton, and S.L. Salzberg. 2002. Fast algorithms for large-scale genome alignment and comparison. Nucleic Acids Research 30 (11): 2478–2483.
Article Google Scholar
Edgar, R.C. 2004. MUSCLE: A multiple sequence alignment method with reduced time and space complexity. BMC Bioinformatics 5 (1): 113.
Article Google Scholar
Jo, H., and G. Koh. 2015. Faster single-end alignment generation utilizing multi-thread for BWA. Bio-Medical Materials and Engineering 26 (s1): S1791–S1796.
Article Google Scholar
Langmead, B., and S.L. Salzberg. 2012. Fast gapped-read alignment with Bowtie 2. Nature Methods 9 (4): 357.
Article Google Scholar
Li, H., B. Handsaker, A. Wysoker, T. Fennell, J. Ruan, N. Homer, and R. Durbin. 2009. The sequence alignment/map format and SAMtools. Bioinformatics 25 (16): 2078–2079.
Article Google Scholar
McKenna, A., M. Hanna, E. Banks, A. Sivachenko, K. Cibulskis, A. Kernytsky, and M.A. DePristo. 2010. The Genome Analysis Toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data. Genome Research 20 (9): 1297–1303.
Article Google Scholar
Kumar, S., G. Stecher, and K. Tamura. 2016. MEGA7: Molecular evolutionary genetics analysis version 7.0 for bigger datasets. Molecular Biology and Evolution 33 (7): 1870–1874.
Google Scholar
Quinlan, A.R., and I.M. Hall. 2010. BEDTools: A flexible suite of utilities for comparing genomic features. Bioinformatics 26 (6): 841–842.
Article Google Scholar
Esumi, S., S.X. Wu, Y. Yanagawa, K. Obata, Y. Sugimoto, and N. Tamamaki. 2008. Method for single-cell microarray analysis and application to gene-expression profiling of GABAergic neuron progenitors. Neuroscience Research 60 (4): 439–451.
Article Google Scholar
Tang, F., C. Barbacioru, Y. Wang, E. Nordman, C. Lee, N. Xu, and K. Lao. 2009. mRNA-Seq whole-transcriptome analysis of a single cell. Nature Methods 6 (5): 377–382.
Article Google Scholar
Angerer, P., L. Simon, S. Tritschler, F.A. Wolf, D. Fischer, and F.J. Theis. 2017. Single cells make big data: New challenges and opportunities in transcriptomics. Current Opinion in Systems Biology 4: 85–91.
Article Google Scholar
O’Driscoll, A., J. Daugelaite, and R.D. Sleator. 2013. ‘Big data’, Hadoop and cloud computing in genomics. Journal of Biomedical Informatics 46 (5): 774–781.
Article MATH Google Scholar
Dolinski, K., and O.G. Troyanskaya. 2015. Implications of big data for cell biology. Molecular Biology of the Cell 26 (14): 2575–2578.
Article Google Scholar
Marx, V. 2013. Biology: The big challenges of big data.
Google Scholar
Grabherr, M.G., B.J. Haas, M. Yassour, J.Z. Levin, D.A. Thompson, I. Amit, and Z. Chen. 2011. Trinity: Reconstructing a full-length transcriptome without a genome from RNA-Seq data. Nature Biotechnology 29 (7): 644.
Article Google Scholar
Robertson, G., J. Schein, R. Chiu, R. Corbett, M. Field, S.D. Jackman, and M. Griffith. 2010. De novo assembly and analysis of RNA-seq data. Nature Methods 7 (11): 909–912.
Article Google Scholar
DeLuca, D.S., J.Z. Levin, A. Sivachenko, T. Fennell, M.D. Nazaire, C. Williams, and G. Getz. 2012. RNA-SeQC: RNA-seq metrics for quality control and process optimization. Bioinformatics 28 (11): 1530–1532.
Article Google Scholar
Song, L., and L. Florea. 2015. Rcorrector: Efficient and accurate error correction for Illumina RNA-seq reads. GigaScience 4 (1): s13742–s14015.
Article Google Scholar
Robinson, M.D., D.J. McCarthy, and G.K. Smyth. 2010. edgeR: A Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26 (1): 139–140.
Article Google Scholar
Forster, S.C., A.M. Finkel, J.A. Gould, and P.J. Hertzog. 2013. RNA-eXpress annotates novel transcript features in RNA-seq data. Bioinformatics 29 (6): 810–812.
Article Google Scholar
Shi, Y., A.M. Chinnaiyan, and H. Jiang. 2015. rSeqNP: A non-parametric approach for detecting differential expression and splicing from RNA-Seq data. Bioinformatics 31 (13): 2222–2224.
Article Google Scholar
Swan, A. L., A. Mobasheri, D. Allaway, S. Liddell, and J. Bacardit. 2013. Application of machine learning to proteomics data: classification and biomarker identification in postgenomics biology. Omics: A Journal of Integrative Biology 17 (12): 595–610.
Google Scholar
Bantscheff, M., M. Schirle, G. Sweetman, J. Rick, and B. Kuster. 2007. Quantitative mass spectrometry in proteomics: A critical review. Analytical and Bioanalytical Chemistry 389 (4): 1017–1031.
Article Google Scholar
Chalkley, R. J., P. R. Baker, L. Huang, K. C. Hansen, N. P. Allen, M. Rexach, and A. L. Burlingame. Comprehensive analysis of a multidimensional liquid chromatography mass spectrometry dataset acquired on a quadrupole selecting, quadrupole collision cell, time-of-flight mass spectrometer: II. New developments in Protein Prospector allow for reliable and comprehensive automatic analysis of large datasets. Molecular & Cellular Proteomics 4 (8): 1194–1204.
Google Scholar
Kou, Q., L. Xun, and X. Liu. 2016. TopPIC: A software tool for top-down mass spectrometry-based proteoform identification and characterization. Bioinformatics 32 (22): 3495–3497.
Google Scholar
Yang, X., V. Dondeti, R. Dezube, D.M. Maynard, L.Y. Geer, J. Epstein, and J.A. Kowalak. 2004. DBParser: Web-based software for shotgun proteomic data analyses. Journal of Proteome Research 3 (5): 1002–1008.
Article Google Scholar
Tabb, D. L., J. K. Eng, and J. R. Yates. 2001. Protein identification by SEQUEST. In Proteome Research: Mass Spectrometry, 125–142. Berlin, Heidelberg: Springer.
Google Scholar
Sturm, M., A. Bertsch, C. Gröpl, A. Hildebrandt, R. Hussong, E. Lange, and O. Kohlbacher. 2008. OpenMS–an open-source software framework for mass spectrometry. BMC Bioinformatics 9 (1): 1–11.
Article Google Scholar
MacLean, B., D.M. Tomazela, N. Shulman, M. Chambers, G.L. Finney, B. Frewen, and M.J. MacCoss. 2010. Skyline: An open source document editor for creating and analyzing targeted proteomics experiments. Bioinformatics 26 (7): 966–968.
Article Google Scholar
Geourjon, C., and G. Deleage. 1995. SOPMA: Significant improvements in protein secondary structure prediction by consensus prediction from multiple alignments. Bioinformatics 11 (6): 681–684.
Article Google Scholar
Guex, N., M.C. Peitsch, and T. Schwede. 2009. Automated comparative protein structure modeling with SWISS-MODEL and Swiss-PdbViewer: A historical perspective. Electrophoresis 30 (S1): S162–S173.
Article Google Scholar
Combet, C., M. Jambon, G. Deleage, and C. Geourjon. 2002. Geno3D: Automatic comparative molecular modelling of protein. Bioinformatics 18 (1): 213–214.
Article Google Scholar
Mehrotra, B., and P. Mendes. 2006. Bioinformatics approaches to integrate metabolomics and other systems biology data. In Plant metabolomics, 105–115. Berlin, Heidelberg: Springer.
Google Scholar
Joyce, A.R., and B.O. Palsson. 2006. The model organism as a system: Integrating’omics’ data sets. Nature Reviews Molecular Cell Biology 7 (3): 198–210.
Article Google Scholar
Xia, J., and D.S. Wishart. 2010. MetPA: A web-based metabolomics tool for pathway analysis and visualization. Bioinformatics 26 (18): 2342–2344.
Article Google Scholar
Xia, J., and D. S. Wishart. 2016. Using MetaboAnalyst 3.0 for comprehensive metabolomics data analysis. Current Protocols in Bioinformatics 55 (1): 14–10.
Google Scholar
García-Alcalde, F., F. García-López, J. Dopazo, and A. Conesa. 2011. Paintomics: A web based tool for the joint visualization of transcriptomics and metabolomics data. Bioinformatics 27 (1): 137–139.
Google Scholar
Kamburov, A., R. Cavill, T.M. Ebbels, R. Herwig, and H.C. Keun. 2011. Integrated pathway-level analysis of transcriptomics and metabolomics data with IMPaLA. Bioinformatics 27 (20): 2917–2918.
Article Google Scholar
Xia, J., T.C. Bjorndahl, P. Tang, and D.S. Wishart. 2008. MetaboMiner–semi-automated identification of metabolites from 2D NMR spectra of complex biofluids. BMC Bioinformatics 9 (1): 507.
Article Google Scholar
Neuweger, H., S.P. Albaum, M. Dondrup, M. Persicke, T. Watt, K. Niehaus, and A. Goesmann. 2008. MeltDB: A software platform for the analysis and integration of metabolomics experiment data. Bioinformatics 24 (23): 2726–2732.
Article Google Scholar
Carroll, A.J., M.R. Badger, and A.H. Millar. 2010. The MetabolomeExpress Project: Enabling web-based processing, analysis and transparent dissemination of GC/MS metabolomics datasets. BMC Bioinformatics 11 (1): 376.
Article Google Scholar
Kastenmüller, G., W. Römisch-Margl, B. Wägele, E. Altmaier, and K. Suhre. metaP-server: A web-based metabolomics data analysis tool. Journal of Biomedicine and Biotechnology.
Google Scholar
Cazaly, E., J. Saad, W. Wang, C. Heckman, M. Ollikainen, and J. Tang. 2019. Making sense of the epigenome using data integration approaches. Frontiers in Pharmacology 10: 126.
Article Google Scholar
Holder, L.B., M.M. Haque, and M.K. Skinner. 2017. Machine learning for epigenetics and future medical applications. Epigenetics 12 (7): 505–514.
Article Google Scholar
Pedersen, B., T.F. Hsieh, C. Ibarra, and R.L. Fischer. 2011. MethylCoder: Software pipeline for bisulfite-treated sequences. Bioinformatics 27 (17): 2435–2436.
Article Google Scholar
Krueger, F., and S.R. Andrews. 2011. Bismark: A flexible aligner and methylation caller for Bisulfite-Seq applications. Bioinformatics 27 (11): 1571–1572.
Article Google Scholar
Harris, E.Y., N. Ponts, K.G. Le Roch, and S. Lonardi. 2012. BRAT-BW: Efficient and accurate mapping of bisulfite-treated reads. Bioinformatics 28 (13): 1795–1796.
Article Google Scholar
Kishore, K., S. de Pretis, R. Lister, M.J. Morelli, V. Bianchi, B. Amati, and M. Pelizzola. 2015. methylPipe and compEpiTools: A suite of R packages for the integrative analysis of epigenomics data. BMC Bioinformatics 16 (1): 313.
Article Google Scholar
Fang, F., S. Fan, X. Zhang, and M.Q. Zhang. 2006. Predicting methylation status of CpG islands in the human brain. Bioinformatics 22 (18): 2204–2209.
Article Google Scholar
Das, R., N. Dimitrova, Z. Xuan, R.A. Rollins, F. Haghighi, J.R. Edwards, and M.Q. Zhang. 2006. Computational prediction of methylation status in human genomic sequences. Proceedings of the National Academy of Sciences 103 (28): 10713–10716.
Article Google Scholar
Feltus, F.A., E.K. Lee, J.F. Costello, C. Plass, and P.M. Vertino. 2003. Predicting aberrant CpG island methylation. Proceedings of the National Academy of Sciences 100 (21): 12253–12258.
Article Google Scholar
Bock, C., J. Walter, M. Paulsen, and T. Lengauer. 2007. CpG island mapping by epigenome prediction. PLoS Computational Biology 3 (6): e110.
Article Google Scholar
Segal, E., Y. Fondufe-Mittendorf, L. Chen, A. Thåström, Y. Field, I.K. Moore, and J. Widom. 2006. A genomic code for nucleosome positioning. Nature 442 (7104): 772–778.
Article Google Scholar
Peckham, H.E., R.E. Thurman, Y. Fu, J.A. Stamatoyannopoulos, W.S. Noble, K. Struhl, and Z. Weng. 2007. Nucleosome positioning signals in genomic DNA. Genome Research 17 (8): 1170–1177.
Article Google Scholar
Ravi, D., C. Wong, F. Deligianni, M. Berthelot, J. Andreu-Perez, B. Lo, and G. Z. Yang. 2016. Deep learning for health informatics. IEEE Journal of Biomedical and Health Informatics 21 (1): 4–21.
Google Scholar
Akay, A., and H. Hess. 2019. Deep learning: Current and emerging applications in medicine and technology. IEEE Journal of Biomedical and Health Informatics 23 (3): 906–920.
Article Google Scholar
Schmidhuber, J. 2015. Deep learning in neural networks: An overview. Neural Networks 61: 85–117.
Article Google Scholar
Wei, L., R. Su, B. Wang, X. Li, Q. Zou, and X. Gao. 2019. Integration of deep feature representations and handcrafted features to improve the prediction of N6-methyladenosine sites. Neurocomputing 324: 3–9.
Article Google Scholar
Luo, F., M. Wang, Y. Liu, X.M. Zhao, and A. Li. 2019. DeepPhos: Prediction of protein phosphorylation sites with deep learning. Bioinformatics 35 (16): 2766–2773.
Article Google Scholar
Goh, G.B., N.O. Hodas, and A. Vishnu. 2017. Deep learning for computational chemistry. Journal of Computational Chemistry 38 (16): 1291–1307.
Article Google Scholar
Fu, H., Y. Yang, X. Wang, H. Wang, and Y. Xu. 2019. DeepUbi: A deep learning framework for prediction of ubiquitination sites in proteins. BMC Bioinformatics 20 (1): 1–10.
Article Google Scholar
Raza, K. 2012. Application of data mining in bioinformatics. arXiv preprint arXiv:1205.1125.
Jurtz, V. I., A. R. Johansen, M. Nielsen, J. J. Almagro Armenteros, H. Nielsen, C. K. Sønderby, and S. K. Sønderby. 2017. An introduction to deep learning on biological sequence data: Examples and solutions. Bioinformatics 33 (22): 3685–3690.
Google Scholar
Rhee, S.Y., J. Dickerson, and D. Xu. 2006. Bioinformatics and its applications in plant biology. Annual Review of Plant Biology 57: 335–360.
Article Google Scholar
Min, S., B. Lee, and S. Yoon. 2017. Deep learning in bioinformatics. Briefings in Bioinformatics 18 (5): 851–869.
Google Scholar
Alipanahi, B., A. Delong, M.T. Weirauch, and B.J. Frey. 2015. Predicting the sequence specificities of DNA-and RNA-binding proteins by deep learning. Nature Biotechnology 33 (8): 831–838.
Article Google Scholar
Stein, L.D. 2010. The case for cloud computing in genome informatics. Genome Biology 11 (5): 207.
Article Google Scholar
Rosenthal, A., P. Mork, M.H. Li, J. Stanford, D. Koester, and P. Reynolds. 2010. Cloud computing: A new business paradigm for biomedical information sharing. Journal of Biomedical Informatics 43 (2): 342–353.
Article Google Scholar
Wall, D.P., P. Kudtarkar, V.A. Fusaro, R. Pivovarov, P. Patil, and P.J. Tonellato. 2010. Cloud computing for comparative genomics. BMC Bioinformatics 11 (1): 259.
Article Google Scholar
Kudtarkar, P., T. F. DeLuca, V. A. Fusaro, P. J. Tonellato, and D. P. Wall. 2010. Cost-effective cloud computing: a case study using the comparative genomics tool, roundup. Evolutionary Bioinformatics 6, EBO-S6259.
Google Scholar
Era7 Bioinformatics, https://era7bioinformatics.com.
EagleGenomics, https://www.eaglegenomics.com.
DNAnexus, https://dnanexus.com/.
MaverixBio, https://www.maverixbio.com.

Download references

Author information

Authors and Affiliations

ICAR-National Academy of Agricultural Research Management, Hyderabad, 500030, India
P. Supriya, Balakrishnan Marudamuthu, Sudhir Kumar Soam & Cherukumalli Srinivasa Rao

Authors

P. Supriya
View author publications
You can also search for this author in PubMed Google Scholar
Balakrishnan Marudamuthu
View author publications
You can also search for this author in PubMed Google Scholar
Sudhir Kumar Soam
View author publications
You can also search for this author in PubMed Google Scholar
Cherukumalli Srinivasa Rao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to P. Supriya .

Editor information

Editors and Affiliations

School of Computer Engineering, Kalinga Institute of Industrial Technology, Deemed to be University, Bhubaneswar, Odisha, India
Siddharth Swarup Rautaray
Global Delivery Services, Infor India, Hyderabad, Telangana, India
Phani Pemmaraju
School of Computer Engineering, Kalinga Institute of Industrial Technology, Deemed to be University, Bhubaneswar, Odisha, India
Hrushikesha Mohanty

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Supriya, P., Marudamuthu, B., Soam, S.K., Rao, C.S. (2021). Trends and Application of Data Science in Bioinformatics. In: Rautaray, S.S., Pemmaraju, P., Mohanty, H. (eds) Trends of Data Science and Applications. Studies in Computational Intelligence, vol 954 . Springer, Singapore. https://doi.org/10.1007/978-981-33-6815-6_12

Download citation

DOI: https://doi.org/10.1007/978-981-33-6815-6_12
Published: 22 March 2021
Publisher Name: Springer, Singapore
Print ISBN: 978-981-33-6814-9
Online ISBN: 978-981-33-6815-6
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics