Big Data Analytics and Its Prospects in Computational Proteomics

Conference paper
Part of the Advances in Intelligent Systems and Computing book series (AISC, volume 340)


The volume and variety of data in biology is increasing at an exponential velocity. Every week new proteins are getting sequenced and novel structures are being discovered. With the advent of hitherto unknown diseases, it has become imperative that vaccines and drugs be designed as fast as possible. This is causing an immense surge of information which is becoming increasing difficult to process due to limited computational resources. Thus the need of the hour is to harness technologies, like Big Data, which will help distribute computations over a group of nodes and hasten the process of data analysis. In this paper we have explored some techniques to dispense the job of data analysis to several computers which could work in parallel and reach a solution faster.


Big data Computational proteomics Hadoop MapReduce Parallel implementation 


  1. 1.
    Shvachko, K., Kuang, H., Radia, S., Chansler, R.: The Hadoop distributed file system. In: IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST), pp. 1–10. IEEE (2010)Google Scholar
  2. 2.
    May, M.: Life science technologies: big biological impacts from big data. Science (80), 344, 1298–1300 (2014)Google Scholar
  3. 3.
    Berman, H.M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T.N., Weissig, H., Shindyalov, I.N., Bourne, P.E.: The protein data bank. Nucleic Acids Res. 28, 235–242 (2000)CrossRefGoogle Scholar
  4. 4.
    The UniProt Consortium: Update on activities at the Universal Protein Resource (UniProt) in 2013. Nucleic Acids Res. 41, D43–D47 (2013)CrossRefGoogle Scholar
  5. 5.
    Finn, R.D., Bateman, A., Clements, J., Coggill, P., Eberhardt, R.Y., Eddy, S.R., Heger, A., Hetherington, K., Holm, L., Mistry, J., Sonnhammer, E.L.L., Tate, J., Punta, M.: Pfam: the protein families database. Nucleic Acids Res. 42, D222–D230 (2014)CrossRefGoogle Scholar
  6. 6.
    Murzin, A.G., Brenner, S.E., Hubbard, T., Chothia, C.: SCOP: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol. 247, 536–540 (1995)Google Scholar
  7. 7.
    Andreeva, A., Howorth, D., Chothia, C., Kulesha, E., Murzin, A.G.: SCOP2 prototype: a new approach to protein structure mining. Nucleic Acids Res. 42, D310–D314 (2014)CrossRefGoogle Scholar
  8. 8.
    Orengo, C.A., Michie, A.D., Jones, S., Jones, D.T., Swindells, M.B., Thornton, J.M.: CATH–a hierarchic classification of protein domain structures. Structure 5, 1093–1109 (1997)CrossRefGoogle Scholar
  9. 9.
    Grabowski, T.J., Cho, H.S., Vonsattel, J.P.G., Rebeck, G.W., Greenberg, S.M.: Novel amyloid precursor protein mutation in an Iowa family with dementia and severe cerebral amyloid angiopathy. Ann. Neurol. 49, 697–705 (2001)CrossRefGoogle Scholar
  10. 10.
    Blum, M., Floyd, R.W., Pratt, V., Rivest, R.L., Tarjan, R.E.: Time bounds for selection. J. Comput. Syst. Sci. 7, 448–461 (1973)MathSciNetCrossRefMATHGoogle Scholar
  11. 11.
    Liao, C.-S., Lu, K., Baym, M., Singh, R., Berger, B.: IsoRankN: spectral methods for global alignment of multiple protein networks. Bioinformatics 25, i253–i258 (2009)CrossRefGoogle Scholar
  12. 12.
    Hecker, M., Lambeck, S., Toepfer, S., Van Someren, E., Guthke, R.: Gene regulatory network inference: data integration in dynamic models—a review. Biosystems 96, 86–103 (2009)CrossRefGoogle Scholar
  13. 13.
    Sumazin, P., Yang, X., Chiu, H.H.-S., Chung, W.W.-J., Iyer, A., Llobet-Navas, D., Rajbhandari, P., Bansal, M., Guarnieri, P., Silva, J.: An extensive microRNA-mediated network of RNA–RNA interactions regulates established oncogenic pathways in glioblastoma. Cell 147, 370–381 (2011)CrossRefGoogle Scholar
  14. 14.
    Pancaldi, V., Bähler, J.: In silico characterization and prediction of global protein–mRNA interactions in yeast. Nucleic Acids Res. 39, 5826–5836 (2011)CrossRefGoogle Scholar
  15. 15.
    Chatterjee, P., Basu, S., Kundu, M., Nasipuri, M., Plewczynski, D.: PPI_SVM: prediction of protein–protein interactions using machine learning, domain–domain affinities and frequency tables. Cell. Mol. Biol. Lett. 16, 264–278 (2011)CrossRefGoogle Scholar
  16. 16.
    Bas, D.C., Rogers, D.M., Jensen, J.H.: Very fast prediction and rationalization of pKa values for protein–ligand complexes. Proteins Struct. Funct. Bioinf. 73, 765–783 (2008)CrossRefGoogle Scholar
  17. 17.
    Basu, S., Plewczynski, D.: AMS 3.0: prediction of post-translational modifications. BMC Bioinf. 11, 210 (2010)CrossRefGoogle Scholar
  18. 18.
    Plewczynski, D., Basu, S., Saha, I.: AMS 4.0: consensus prediction of post-translational modifications in protein sequences. Amino Acids 43, 573–582 (2012)CrossRefGoogle Scholar
  19. 19.
    Chatterjee, P., Basu, S., Kundu, M., Nasipuri, M., Plewczynski, D.: PSP_MCSVM: brainstorming consensus prediction of protein secondary structures using two-stage multiclass support vector machines. J. Mol. Model. 17, 2191–2201 (2011)CrossRefGoogle Scholar
  20. 20.
    Sriwastava, B.K., Basu, S., Maulik, U., Plewczynski, D.: PPIcons: identification of protein–protein interaction sites in selected organisms. J. Mol. Model. 19, 4059–4070 (2013)CrossRefGoogle Scholar

Copyright information

© Springer India 2015

Authors and Affiliations

  1. 1.Department of Computer Science and EngineeringJadavpur UniversityKolkataIndia

Personalised recommendations