Protein Sequence Databases

  • Michele Magrane
  • Maria Jesus Martin
  • Claire O’Donovan
  • Rolf Apweiler

Abstract

With the availability of almost 150 completed genome sequences from both eukaryotic and prokaryotic organisms, efforts are now being focused on the identification and functional analysis of the proteins encoded by these genomes. The rapidly emerging field of proteomics, the large-scale analysis of these proteins, has started to generate huge amounts of data as a result of the new information provided by the genome projects and by a range of new technologies in protein science. For example, mass spectrometry approaches are being used in protein identification and in determining the nature of posttranslational modifications (1, and large-scale yeast two-hybrid screens provide valuable data about protein-protein interactions (2. These and other methods now make it possible to quickly identify large numbers of proteins in a complex, to map their interactions in a cellular context, to determine their location within the cell, and to analyze their biological activities. Protein sequence databases play a vital role as a central resource for storing the data generated by these efforts and making them freely available to the scientific community. Data from large-scale experiments are often no longer published in a conventional sense but are deposited in a database. This means that protein sequence databases are the most comprehensive resource of information on proteins available to scientists.

References

  1. 1.
    Sickmann, A., Mreyen, M., and Meyer, H. E. (2003) Mass spectrometry-a key technology in proteome research. Adv. Biochem. Eng. Biotechnol. 83, 141–76.PubMedGoogle Scholar
  2. 2.
    Coates, P. J. and Hall, P. A. (2003) The yeast two-hybrid system for identifying proteinprotein interactions. J. Pathol. 199, 4–7.PubMedCrossRefGoogle Scholar
  3. 3.
    Wheeler, D. L., Church, D. M., Federhen, S., et al. (2003) Database resources of the National Center for Biotechnology. Nucl. Acids Res. 31, 28–33.PubMedCrossRefGoogle Scholar
  4. 4.
    Miyazaki, S., Sugawara, H., Gojobori, T., and Tateno, Y. (2003) DNA Data Bank of Japan in XML. Nucleic Acids Res. 31, 13–16.PubMedCrossRefGoogle Scholar
  5. 5.
    Stoesser, G., Baker, W., van den Broek, A., et al. (2003) The EMBL Nucleotide Sequence Database: major new developments. Nucleic Acids Res. 31, 17–22.PubMedCrossRefGoogle Scholar
  6. 6.
    Benson, D. A., Karsch-Mizrachi, I., Lipman, D. J., Ostell, J., and Wheeler, D. L. (2003) GenBank. Nucleic Acids Res. 31, 23–27.PubMedCrossRefGoogle Scholar
  7. 7.
    Boeckmann, B., Bairoch, A., Apweiler, R., et al. (2003). The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Res. 31, 365–370.PubMedCrossRefGoogle Scholar
  8. 8.
    Wu, C. H., Yeh, L. S., Huang, H., et al. (2003). The Protein Information Resource. Nucleic Acids Res. 31, 345–347.PubMedCrossRefGoogle Scholar
  9. 9.
    Pruitt, K. D., Tatusova, T., and Maglott, D. R. (2003) NCBI Reference Sequence Project: update and current status. Nucleic Acids Res. 31, 4–37.CrossRefGoogle Scholar
  10. 10.
    Westbrook, J., Feng, Z., Chen, L., Yang, H., and Berman, H. M. (2003) The Protein Data Bank and structural genomics. Nucleic Acids. Res. 31, 489–491.PubMedCrossRefGoogle Scholar
  11. 11.
    Dayhoff, M. O. (1978) Atlas of Protein Sequence and Structure Vol. 5Supplement 3. National Biomedical Research Foundation, Washington, DC.Google Scholar
  12. 12.
    Gasteiger, E., Jung, E., and Bairoch, A. (2001) SWISS-PROT: connecting biomolecular knowledge via a protein database. Curr. Issues Mol. Biol. 3, 47–55.PubMedGoogle Scholar
  13. 13.
    Wain, H. M., Lush, M., Ducluzeau, F., and Povey, S. (2002) Genew: the human gene nomenclature database. Nucleic Acids Res. 30, 169–171.PubMedCrossRefGoogle Scholar
  14. 14.
    FlyBase consortium. (2003) The FlyBase database of the Drosophila genome projects and community literature.Nucleic Acids Res. 31, 172–175.CrossRefGoogle Scholar
  15. 15.
    Blake, J. A., Richardson, J. E., Bult, C. J., Kadin, J. A., and Eppig, J. T. (2003) MGD: the Mouse Genome Database. Nucleic Acids Res. 31, 193–195.PubMedCrossRefGoogle Scholar
  16. 16.
    Junker, V., Apweiler, R., and Bairoch, A. (1999) Representation of functional information in the Swiss-Prot data bank. Bioinformatics 15, 1066–1067.PubMedCrossRefGoogle Scholar
  17. 17.
    O’Donovan, C., Martin, M. J., Glemet, E., Codani, J., and Apweiler, R. (1999) Removing redundancy in Swiss-Prot and TrEMBL. Bioinformatics 15, 258–259.CrossRefGoogle Scholar
  18. 18.
    Apweiler, R. (2001) Functional information in SWISS-PROT: the basis for large-scale characterisation of protein sequences. Briefings in Bioinformatics 2, 9–18.PubMedCrossRefGoogle Scholar
  19. 19.
    Fleischmann, W., Moeller, S., Gateau, A., and Apweiler, R. (1998) A novel method for automatic and reliable functional annotation. Bioinformatics 15, 228–233.CrossRefGoogle Scholar
  20. 20.
    Mulder, N. J, Apweiler, R., Attwood, T. K., et al. (2003) The InterPro database, 2003 brings increased coverage and new features. Nucleic Acids Res. 31, 315–318.PubMedCrossRefGoogle Scholar
  21. 21.
    Falquet, L., Pagni, M., Bucher, P., et al. (2002) The PROSITE database, its status in 2002. Nucleic Acids Res. 30, 235–238.PubMedCrossRefGoogle Scholar
  22. 22.
    Attwood, T. K., Bradley, P., Flower, D. R., et al. (2003) PRINTS and its automatic supplement, prePRINTS. Nucleic Acids Res. 31, 400–402.PubMedCrossRefGoogle Scholar
  23. 23.
    Bateman, A., Birney, E., Cerruti, L., et al. (2002) The Pfam protein families database. Nucleic Acids Res. 30, 276–280.PubMedCrossRefGoogle Scholar
  24. 24.
    Corpet, F., Servant, F., Gouzy, J., and Kahn, D. (2000) ProDom and ProDom-CG: tools for protein domain analysis and whole genome comparisons. Nucleic Acids Res. 28, 267–269.PubMedCrossRefGoogle Scholar
  25. 25.
    Letunic, I., Goodstadt, L., Dickens, N. J., et al. (2002) Recent improvements to the SMART domain-based sequence annotation resource. Nucleic Acids Res. 30, 242–244.PubMedCrossRefGoogle Scholar
  26. 26.
    Haft, D. H., Selengut, J. D., and White, O. (2003) The TIGRFAMs database of protein families. Nucleic Acids Res. 31, 371–373.PubMedCrossRefGoogle Scholar
  27. 27.
    Huang, H., Barker, W. C., Chen, Y., and Wu, C. H. (2003) iProClass: an integrated database of protein family, function and structure information. Nucleic Acids Res. 31, 390–392.PubMedCrossRefGoogle Scholar
  28. 28.
    Gough, J., Karplus, K., Hughey, R., and Chothia, C. (2001) Assignment of homology to genome sequences using a library of Hidden Markov Models that represent all proteins of known structure. J. Mol. Biol. 313, 903–919.PubMedCrossRefGoogle Scholar
  29. 29.
    Rawlings, N. D., O’Brien, E., and Barrett, A. J. (2002) MEROPS: the protease database. Nucleic Acids Res. 30, 343–346.PubMedCrossRefGoogle Scholar
  30. 30.
    Butler, D.(2002) NIH pledges cash for global protein database. Nature 419, 101.Google Scholar
  31. 31.
    Clamp, M., Andrews, D., Barker, D., et al. (2003) Ensembl 2002: accommodating comparative genomics. Nucleic Acids Res. 31, 38–42.PubMedCrossRefGoogle Scholar
  32. 32.
    Harris, T. W., Lee, R., Schwarz, E., et al. (2003) WormBase: a cross-species database for comparative genomics. Nucleic Acids Res. 31, 133–137.PubMedCrossRefGoogle Scholar

Copyright information

© Humana Press Inc., Totowa, NJ 2005

Authors and Affiliations

  • Michele Magrane
    • 1
  • Maria Jesus Martin
    • 1
  • Claire O’Donovan
    • 1
  • Rolf Apweiler
    • 1
  1. 1.European Bioinformatics InstituteCambridge

Personalised recommendations