Abstract
Discovery of genome as well as protein sequencing aroused interest in bioinformatics and propelled the necessity to create databases of biological sequences. These data are processed in useful knowledge/information by data mining before storing into databases. This book chapter aims to present a detailed overview of different types of database called as primary, secondary and composite databases along with many specialized biological databases for RNA molecules, protein-protein interaction, genome information, metabolic pathways, phylogenetic information etc. Attempt has also been made to focus on drawbacks of present biological databases. Moreover, this book chapter provides an elaborate and illustrative discussion about various bioinformatics tools used for gene prediction, sequence analysis, phylogenetic analysis, protein structure as well as function prediction, molecular interactions prediction for several purposes including discovery of new gene as well as conserved regions in protein families, estimation of evolutionary relationships among organisms, 3D structure prediction of drug targets for exploring the mechanism as well as new drug discovery and protein-protein interactions for exploring the signaling pathways.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Hogeweg, P.: The roots of bioinformatics in theoretical biology. PLoS Comput. Biol. 7, e1002021 (2011)
Neufeld, L., Cornog, M.: Database history: from dinosaurs to compact discs. J Am. Soc. Inf. Sci. 37, 183–190 (1999)
Chen, M.-S., Han, J., Yu, P.S.: Data mining: an overview from a database perspective. IEEE Trans. Knowl. Data Eng. 8(6), 866–883 (1996)
Fayyad, U.M., Piatetsky-Shapiro, G., Smyth, P., et al. (eds.): Advance in Knowledge Discovery and Data Mining. AAAI/MIT Press, Menlo Park, Cambridge (1996)
Ouzounis, C.A., Valencia, A.: Early bioinformatics: the birth of a discipline-a personal view. Bioinformatics 19, 2176–2190 (2003)
Hassanie, A.E.: Classification and feature selection of breast cancer data based on decision tree algorithm. Stud. Inform. Control 12(1), 33–40
Boeckmann, B., Bairoch, A., Apweiler, R., et al.: The SWISS-PROT protein knowledge base and its supplement TrEMBL in 2003. Nucleic Acids Res. 31, 365–370 (2003)
UniProt Consortium: The universal protein resource (UniProt). Nucleic Acids Res. 36, D190–D195 (2008)
UniProt Consortium: The universal protein resource (UniProt) in 2010. Nucleic Acids Res. 38, D142–D148 (2010)
UniProt Consortium: Activities at the universal protein resource (UniProt). Nucleic Acids Res. 42, D191–D198 (2014)
Wu, C.H., Yeh, L.S., Huang, H., et al.: The protein information resource. Nucleic Acids Res. 31(1), 345–347 (2003)
Benson, D.A., Karsch-Mizrachi, I., Lipman, D.J., et al.: GenBank. Nucleic Acids Res. 36, D25–D30 (2008)
Kanz, C., Aldebert, P., Althorpe, N., et al.: The EMBL nucleotide sequence database. Nucleic Acids Res. 33, D29–D33 (2005)
Miyazaki, S., Sugawara, H., Gojobori, T., et al.: DNA data bank of Japan (DDBJ) in xml. Nucleic Acids Res. 31, 13–16 (2003)
Berman, H.M., Westbrook, J., Feng, Z., et al.: The protein data bank. Nucleic Acids Res. 28, 235–242 (2000)
Berman, H., Henrick, K., Nakamura, H.: Announcing the worldwide protein data bank. Nat. Struct. Mol. Biol. 10, 980 (2003)
Barker, W.C., Garavelli, J.S., Huang, H., et al.: The protein information resource (PIR). Nucleic Acids Res. 28, 41–44 (2000)
Barker, W.C., Garavelli, J.S., Haft, D.H., et al.: The PIR-international protein sequence database. Nucleic Acids Res. 26(1), 27–32 (1998)
Finn, R.D., Bateman, A., Clements, J., et al.: Pfam: the protein families database. Nucleic Acids Res. 42, D222–D230 (2014)
Gonzalez, S., Binato, R., Guida, L., et al.: Conserved transcription factor binding sites suggest an activator basal promoter and a distal inhibitor in the galanin gene promoter in mouse ES cells. Gene 538, 228–234 (2014)
Murzin, A.G., Brenner, S.E., Hubbard, T., et al.: Scop: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol. 247, 536–540 (1995)
Pearl, F., Todd, A., Sillitoe, I., et al.: The CATH domain structure database and related resources Gene3D and DHS provide comprehensive domain family information for genome analysis. Nucleic Acids Res. 33, D247–D251 (2005)
Sigrist, C.J., de Castro, E., Cerutti, L., et al.: New and continuing developments at PROSITE. Nucleic Acids Res. 41, D344–D347 (2013)
Attwood, T.К., Beck, M.E., Flower, D.R., et al.: The PRINTS protein fingerprint database in its fifth year. Nucleic Acids Res. 26(1), 304–308 (1998)
Huang, J.Y., Brutlag, D.L.: The EMOTIF database. Nucleic Acids Res. 29, 202–204 (2001)
Orengo, C.A., Michie, A.D., Jones, S., et al.: CATH—a hierarchic classification of protein domain structures. Structure 5(8), 1093–1108 (1997)
Altschul, S.F., Madden, T.L., Schäffer, A.A., et al.: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25, 3389–3402 (1997)
Gibrat, J.F., Madej, T., Bryant, S.H.: Surprising similarities in structure comparison. Curr. Opin. Struct. Biol. 6, 377–385 (1996)
Benson, D.A., Karsch-Mizrachi, I., Clark, K., et al.: GenBank. Nucleic Acids Res. 40, D48–D53 (2012)
Kinjo, A.R., Suzuki, H., Yamashita, R., et al.: Protein data bank Japan (PDBj): maintaining a structural data archive and resource description framework format. Nucleic Acids Res. 40, D453–D460 (2012)
Burge, S.W., Daub, J., Eberhardt, R., et al.: Rfam 11.0: 10 years of RNA famlies. Nucleic Acids Res. 4, D226–D232 (2013)
Orchard, S., Ammari, M., Aranda, B., et al.: The MIntAct project–IntAct as a common curation platform for 11 molecular interaction databases. Nucleic Acids Res. 42, D358–D363 (2014)
Zanzoni, A., Montecchi-Palazzi, L., Quondam, M., et al.: MINT: a molecular INTeraction database. FEBS Lett. 513, 135–140 (2002)
Joshi-Tope, G., Gillespie, M., Vastrik, I., et al.: Reactome: a knowledgebase of biological pathways. Nucleic Acids Res. 33, D428–D432 (2005)
Saier Jr, M.H., Tran, C.V., Barabote, R.D.: TCDB: the transporter classification database for membrane transport protein analyses and information. Nucleic Acids Res. 34, D181–D186 (2006)
Saier Jr, M.H., Reddy, V.S., Tamang, D.G., et al.: The transporter classification database. Nucleic Acids Res. 42, D251–D258 (2014)
Lombard, V., Golaconda, H., Drula, R.E., et al.: The carbohydrate-active enzymes database (CAZy) in 2013. Nucleic Acids Res. 42(D1), D490–D495 (2014)
Bowes, J.B., Snyder, K.A., Segerdell, E., et al.: Xenbase: gene expression and improved integration. Nucleic Acids Res. 38, D607–D612 (2010)
Cherry, J.M., Hong, E.L., Amundsen, C., et al.: Saccharomyces genome database: the genomics resource of budding yeast. Nucleic Acids Res. 40(D): D700–705 (2012)
St. Pierre, S.E., Ponting, L., Stefancsik, R., et al.: FlyBase 102–advanced approaches to interrogating FlyBase. Nucleic Acids Res. 42: D780–788 (2014)
Caspi, R., Altman, T., Billington, R.: The MetaCyc database of metabolic pathways and enzymes and the BioCyc collection of pathway/genome databases. Nucleic Acids Res. 42, D459–D471 (2014)
Thomas, P.D., Campbell, M.J., Kejariwal, A., et al.: PANTHER: a library of protein families and subfamilies indexed by function. Genome Res. 13(9), 2129–2141 (2003)
Kanehisa, M.: The KEGG database. Silico Simul. Biological Process. 247, 91–103 (2002)
Morell, V.: TreeBASE: the roots of phylogeny. Science 273, 569 (1996)
Huerta-Cepas, J., Capella-Gutiérrez, S., Pryszcz, L.P., et al.: PhylomeDB v4: zooming into the plurality of evolutionary histories of a genome. Nucleic Acids Res. 42, D897–D902 (2014)
Mitchell, A., Chang, H.-Y., Daugherty, L., et al.: InterPro protein families database: the classification resource after 15 years. Nucleic Acids Res. 43, D213–D221 (2015)
Martens, L., Hermjakob, H., Jones, P., et al.: PRIDE: the proteomics identifications database. Proteomics 5, 3537–3545 (2005)
Flicek, P., Amode, M.R., Barrell, D., et al.: Ensembl Nucleic Acids Res. 40, D84–D90 (2012)
Gaudet, P., Fey, P., Basu, S., et al.: dictyBase update 2011: web 2.0 functionality and the initial steps towards a genome portal for the Amoebozoa. Nucleic Acids Res. 39, D620–D624 (2011)
Rajoka, M.I., Idrees, S., Khalid, S., et al.: Medherb: an interactive bioinformatics database and analysis resource for medicinally important herbs. Curr. Bioinformatics 9, 23–27 (2014)
Lamesch, P., Berardini, T.Z., Li, D., et al.: The arabidopsis information resource (TAIR): improved gene annotation and new tools. Nucleic Acids Res. 2011, 1–9 (2011)
Muller, H.M., Kenny, E.E., Sternberg, P.W.: Textpresso: an ontology-based information retrieval and extraction system for biological literature. PLoS Biol. 2(11), e309 (2004)
Yang, K., Dinasarapu, A.R., Reis, E.S., et al.: CMAP: complement map database. Bioinformatics 29(14), 1832–1833 (2013)
Wishart, D.S., Tzur, D., Knox, C., et al.: HMDB: the human metabolome database. Nucleic Acids Res. 35, D521–D526 (2007)
Schaefer, C.F., Anthony, K., Krupa, S., et al.: PID: the pathway interaction database. Nucleic Acids Res. 37, D674–D679 (2009)
Dinasarapu, A.R., Saunders, B., Ozerlat, I., et al.: Signaling gateway molecule pages–a data model perspective. Bioinformatics 27(12), 1736–1738 (2011)
Philippi, S., Köhler, J.: Addressing the problems with life-science databases for traditional uses and systems biology. Nat. Rev. Genet. 7(6), 482–488 (2000)
Lewis, S., Ashburner, M., Reese, M.G.: Annotating eukaryote genomes. Curr. Opin. Struct. Biol. 10, 349–354 (2000)
Birney, E., Durbin, R.: Using GeneWise in the Drosophila annotation experiment. Genome Res. 10, 547–548 (2000)
Yeh, R.-F., Lim, L.P., Burge, C.B.: Computational inference of homologous gene structures in the human genome. Genome Res. 11, 803–816 (2001)
Snyder, E.E., Stormo, G.D.: Identification of protein coding regions in genomic DNA. J. Mol. Biol. 248, 1–18 (1995)
Uberbacher, E.C., Mural, R.J.: Locating protein-coding regions in human DNA sequences by a multiple sensor-neural network approach. Proc. Natl. Acad. Sci. USA 88, 11261–11265 (1991)
Burge, C., Karlin, S.: Prediction of complete gene structures in human genomic DNA. J. Mol. Biol. 268, 78–94 (1997)
Kulp, D., Haussler, D., Reese, M.G., et al.: A generalized hidden Markov model for the recognition of human genes in DNA. In: Proceedings of the International Conference on Intelligent Systems for Molecular Biology, vol. 4, pp. 134–142 (1996)
Krogh, A.: Two methods for improving performance of an HMM and their application for gene-finding. In: Proceedings of the International Conference on Intelligent Systems for Molecular Biology (ISMB‘97), vol. 5, pp. 179–186 (1997)
Parra, G., Blanco, E., Guig´o, R.: GeneID in Drosophila. Genome Res. 10, 391–393 (2000)
Salamov, A.A., Solovyev, V.V.: Ab initio gene finding in Drosophila genomic DNA. Genome Res. 10, 516–522 (2000)
Lencz, T., Guha, S., Liu, C., Rosenfeld, J., et al.: Genome-wide association study implicates NDST3 in schizophrenia and bipolar disorder. Nat. Commun. 4, 2739 (2013)
Peng, Z., Lu, Y., Li, L., et al.: The draft genome of the fastgrowing non-timber forest species moso bamboo (Phyllostachys heterocycla). Nat. Genet. 45, 456–461 (2013)
Geer, R.C., Sayers, E.W.: Entrez: making use of its power. Brief Bioinform. 4, 179–184 (2003)
Parmigiani, G., Garrett, E.S., Irizarry, R.A., et al.: The analysis of gene expression data: an overview of methods and software. Springer, New York (2003)
Hoersch, S., Leroy, C., Brown, N.P., et al.: The GeneQuiz web server: protein functional analysis through the web. Trends Biochem. Sci. 25, 33–35 (2000)
Thompson, J.D., Higgins, D.G., Gibson, T.J.: CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 22, 4673–4680 (1994)
Waterhouse, A.M., Procter, J.B., Martin, D.M.A., et al.: Jalview version 2-a multiple sequence alignment editor and analysis workbench. Bioinformatics 25, 1189–1191 (2009)
Thomas, P., Starlinger, J., Vowinkel, A., Arzt, S., Leser, U.: GeneView: a comprehensive semantic search engine for PubMed. Nucleic Acids Res. 40, W585–W591 (2012)
Page, R.D.M.: TREEVIEW: An application to display phylogenetic trees on personal computers. Comput. Appl. Biosci. 12, 357–358 (1996)
Zhang, Y., Phillips, C.A., Rogers, G.L., et al.: On finding bicliques in bipartite graphs: a novel algorithm and its application to the integration of diverse biological data types. BMC Bioinformatics 15, 110 (2014)
Sievers, F., Wilm, A., Dineen, D.G., et al.: Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega. Mol. Syst. Biol. 7, 539 (2011)
Finn, R.D., Clements, J., Eddy, S.R.: HMMER web server: interactive sequence similarity searching. Nucleic Acids Res. 39, W29–W37 (2011)
Allen, J.E., Salzberg, S.L.: JIGSAW: integration of multiple sources of evidence for gene prediction. Bioinformatics 21(18), 3596–3603 (2005)
Weckx, S., Del-Favero, J., Rademakers, R.: novoSNP, a novel computational tool for sequence variation discovery. Genome Res. 15(3), 436–442 (2005)
Gasteiger, E., Hoogland, C., Gattiker, A., et al.: Protein identification and analysis tools on the expasy server. In: Walker, J.M. (ed.) The Proteomics Protocols Handbook. Humana Press, p 571–607 (2005)
Kanchan, S., Mehrotra, R., Chowdhury, S.: Evolutionary pattern of four representative DNA repair proteins across six model organisms: an in silico analysis. Netw. Model Anal. Health Inf. Bioinform 3, 70 (2014)
Kanchan, S., Mehrotra, R., Chowdhury, S.: In Silico analysis of the Endonuclease III protein family identifies key residues and processes during evolution. J. Mol. Evol. 81(1–2), 54–67 (2015)
Khan, F.A., Phillips, C.D., Baker, R.J.: Timeframes of speciation, reticulation, and hybridization in the bulldog bat explained through phylogenetic analyses of all genetic transmission elements. Syst. Biol. 63, 96–110 (2014)
Price, M.N., Dehal, P.S., Arkin, A.P.: FastTree 2–approximately maximum likelihood trees for large alignments. PLoS ONE 5, e9490 (2010)
Kumar, S., Tamura, K., Nei, M.: MEGA: molecular evolutionary genetics analysis software for microcomputers. Comput. Appl. Biosci. 10, 189–191 (1994)
Huang, T., He, Z.S., Cui, W.R., et al.: A sequence-based approach for predicting protein disordered regions. Protein Pept. Lett. 20, 243–248 (2013)
Liwo, A., Lee, J., Ripoll, D.R., et al.: Protein structure predictionby global optimization of a potential energy function. Proc. Natl. Acad. Sci. USA 96(10), 5482–5485 (1999)
Bowie, J., Luthy, R., Eisenberg, D.: A method to identify protein sequences that fold into a known three-dimensional structure. Science 253(5016), 164–170 (1991)
Šali, A., Blundell, T.L.: Comparative protein modelling by satisfaction of spatial restraints. J. Mol. Biol. 234(3), 779–815 (1993)
Kesheri, M., Kanchan, S., Chowdhury, S., et al.: Secondary and tertiary structure prediction of proteins: a bioinformatic approach. In: Zhu, Q., Azar, A.T. (eds.) Complex System Modelling and Control Through Intelligent Soft Computations, pp. 541–569. Springer International Publishing, Switzerland (2015)
Kesheri, M., Kanchan, S., Richa, et al.: Isolation and in silico analysis of Fe-superoxide dismutase in the cyanobacterium Nostoc commune. Gene. 553(2): 117–125 (2014)
Källberg, M., Wang, H., Wang, S., et al.: Template-based protein structure modeling using the RaptorX web server. Nat. Protoc. 7, 1511–1522 (2012)
Cuff, J.A., Clamp, M.E., Siddiqui, A.S., et al.: JPred: a consensus secondary structure prediction server. Bioinformatics 14, 892–893 (1998)
Bystroff, C., Thorsson, V., Baker, D.: HMMSTR: a hidden Markov model for local sequence-structure correlations in proteins. J. Mol. Biol. 301, 173–190 (2000)
Raghava, G.: APSSP2: a combination method for protein secondary structure prediction based on neural network and example based learning. CASP5 A-132 (2002)
Eswar, N., Eramian, D., Webb, B., et al.: Protein structure modeling with MODELLER. Methods Mol. Biol. 426, 145–159 (2008)
Kelley, L.A., Sternberg, M.J.: Protein structure prediction on the web: a case study using the Phyre server. Nat. Protoc. 4, 363–371 (2009)
Wang, L., Huang, C., Yang, M.Q., et al.: BindNÂ +Â for accurate prediction of DNA and RNA-binding residues from protein sequence features. BMC Syst. Biol. 4(1), S3 (2010)
Vinayagam, A., Zirin, J., Roesel, C., et al.: Integrating protein-protein interaction networks with phenotypes reveals signs of interactions. Nat. Methods 11, 94–99 (2014)
Schultz, J., Copley, R.R., Doerks, T., et al.: SMART: A Web-based tool for the study of genetically mobile domains. Nucleic Acids Res. 28, 231–234 (2000)
Morris, G.M., Huey, R., Lindstrom, W., et al.: AutoDock4 and AutoDockTools4: automated docking with selective receptor flexibility. J. Comput. Chem. 30, 2785–2791 (2009)
De Vries, S.J., van Dijk, M., Bonvin, A.M.: The HADDOCK web server for data driven biomolecular docking. Nat. Protoc. 5, 883–897 (2010)
Franceschini, A., Szklarczyk, D., Frankild, S., et al.: STRING v9.1: protein-protein interaction networks, with increased coverage and integration. Nucleic Acids Res. 41, D808–D815 (2013)
Flannick, J., Novak, A., Srinivasan, B.S., et al.: Graemlin: general and robust alignment of multiple large interaction networks. Genome Res. 16, 1169–1181 (2006)
Kelley, B.P., Yuan, B., Lewitter, F., et al.: PathBLAST: a tool for alignment of protein interaction networks. Nucleic Acids Res. 32, W83–W88 (2004)
Adamcsek, B., Palla, G., Farkas, I.J., et al.: CFinder: locating cliques and overlapping modules in biological networks. Bioinformatics 22, 1021–1023 (2006)
Fathy, M.E., Hussein, A.S., Tolba, M.F.: Fundamental matrix estimation: a study of error criteria. Pattern Recogn. Lett. 32(2), 383–391 (2011)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this chapter
Cite this chapter
Kumari, A., Kanchan, S., Sinha, R.P., Kesheri, M. (2016). Applications of Bio-molecular Databases in Bioinformatics. In: Dey, N., Bhateja, V., Hassanien, A. (eds) Medical Imaging in Clinical Applications. Studies in Computational Intelligence, vol 651. Springer, Cham. https://doi.org/10.1007/978-3-319-33793-7_15
Download citation
DOI: https://doi.org/10.1007/978-3-319-33793-7_15
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-33791-3
Online ISBN: 978-3-319-33793-7
eBook Packages: EngineeringEngineering (R0)