Skip to main content

Protein Sequence Databases

  • Protocol
  • First Online:
Book cover Data Mining Techniques for the Life Sciences

Part of the book series: Methods in Molecular Biology ((MIMB,volume 609))

Abstract

Protein sequence databases do not contain just the sequence of the protein itself but also annotation that reflects our knowledge of its function and contributing residues. In this chapter, we will discuss various public protein sequence databases, with a focus on those that are generally applicable. Special attention is paid to issues related to the reliability of both sequence and annotation, as those are fundamental to many questions researchers will ask. Using both well-annotated and scarcely annotated human proteins as examples, it will be shown what information about the targets can be collected from freely available Internet resources and how this information can be used. The results are shown to be summarized in a simple graphical model of the protein’s sequence architecture highlighting its structural and functional modules.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Protocol
USD 49.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 159.00
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Stretton, A. O. W. (2002) The first sequence: Fred Sanger and insulin. Genetics 162, 527–532.

    PubMed  Google Scholar 

  2. Dayhoff, M. O., Eck, R. V., Chang, M. A., Sochard, M. R. (1965) Atlas of Protein Sequence and Structure. Silver Spring, Maryland: National Biomedical Research Foundation.

    Google Scholar 

  3. Hunt, L. (1984) Margaret Oakley Dayhoff, 1925–1983. Bull Math Biol 46, 467–472.

    Google Scholar 

  4. George, D. G., Barker, W. C., Hunt, L. T. (1986) The protein identification resource (PIR). Nucl Acids Res 14, 11–15.

    Article  CAS  PubMed  Google Scholar 

  5. Bairoch, A., Boeckmann, B. (1991) The SWISS-PROT protein sequence data bank. Nucl Acids Res 19, 2247–2249.

    CAS  PubMed  Google Scholar 

  6. Appel, R. D., Bairoch, A., Hochstrasser, D. F. (1994) A new generation of information retrieval tools for biologists: the example of the ExPASy WWW server. Trends Biochem Sci 19, 258–260.

    Article  CAS  PubMed  Google Scholar 

  7. Rebhan, M., Chalifa-Caspi, V., Prilusky, J., Lancet, D. (1998) GeneCards: a novel functional genomics compendium with automated data mining and query reformulation support. Bioinformatics 14, 656–664.

    Article  CAS  PubMed  Google Scholar 

  8. Maglott, D. R., Katz, K. S., Sicotte, H., Pruitt, K. D. (2000) NCBI’s LocusLink and RefSeq. Nucl Acids Res 28, 126–128.

    Article  CAS  PubMed  Google Scholar 

  9. (2004) Genome Res 14(Special issue on Ensembl), 925–995.

    Google Scholar 

  10. Bairoch, A., Apweiler, R. (1996) The SWISS-PROT protein sequence data bank and its new supplement TREMBL. Nucl Acids Res 24, 21–25.

    Article  CAS  PubMed  Google Scholar 

  11. Claverie, J. M., Sauvaget, I., Bouqueleret, L. (1985) Computer generation and statistical analysis of a data bank of protein sequences translated from Genbank. Biochimie 67, 437–443.

    Article  CAS  PubMed  Google Scholar 

  12. Schuler, G. D., Epstein, J. A., Ohkawa, H., Kans, J. A. (1996) Entrez: molecular biology database and retrieval system. Methods Enzymol 266, 141–162.

    Article  CAS  PubMed  Google Scholar 

  13. Mulder, N., Apweiler, R. (2007) InterPro and InterProScan: tools for protein sequence classification and comparison. Methods Mol Biol 396, 59–70.

    Article  CAS  PubMed  Google Scholar 

  14. Clamp, M., Fry, B., Kamal, M., Xie, X., Cuff, J., Lin, M. F., Kellis, M., Lindblad-Toh, K., Lander, E. S. (2007) Distinguishing protein-coding and noncoding genes in the human genome. Proc Natl Acad Sci USA 104, 19428–19433.

    Article  CAS  PubMed  Google Scholar 

  15. Pruitt, K. D., Tatusova, T., Maglott, D. R. (2007) NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucl Acids Res 35, D61–D65.

    Article  CAS  PubMed  Google Scholar 

  16. http://www.expasy.org/uniprot/P01106

  17. http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=protein&id=71774083

  18. Barrett, T., Troup, D. B., Wilhite, S. E., Ledoux, P., Rudnev, D., Evangelista, C., Kim, I. F., Soboleva, A., Tomashevsky, M., Edgar, R. (2007) NCBI GEO: mining tens of millions of expression profiles – database and tools update. Nucl Acids Res 35, D760–D765.

    Article  CAS  PubMed  Google Scholar 

  19. Pagni, M., Ioannidis, V., Cerutti, L., Zahn-Zabal, M., Jongeneel, C. V., Falquet, L. (2004) MyHits: a new interactive resource for protein annotation and domain identification. Nucl Acids Res 32, W332–W335.

    Article  CAS  PubMed  Google Scholar 

  20. Sperisen, P., Iseli, C., Pagni, M., Stevenson, B. J., Bucher, P., Jongeneel, C. V. (2004) trome, trEST and trGEN: databases of predicted protein sequences. Nucl Acids Res 32, D509–D511.

    Article  CAS  PubMed  Google Scholar 

  21. Bult, C. J., Eppig, J. T., Kadin, J. A., Richardson, J. E., Blake, J. A. (2008) The Mouse Genome Database (MGD): mouse biology and model systems. Nucl Acids Res 36, D724–D728.

    Article  CAS  PubMed  Google Scholar 

  22. Drysdale, R. A., Crosby, M. A., FlyBase Consortium (2005) FlyBase: Genes and gene models. Nucl Acid Res 33, D390–D395.

    Article  CAS  Google Scholar 

  23. Stein, L. D., Sternberg, P., Durbin, R., Thierry-Mieg, J., Spieth, J. (2001) WormBase: network access to the genome and biology of Caenorhabditis elegans. Nucl Acids Res 29, 82–86.

    Article  CAS  PubMed  Google Scholar 

  24. Sickmeier, M., Hamilton, J. A., LeGall, T., Vacic, V., Cortese, M. S., Tantos, A., Szabo, B., Tompa, P., Chen, J., Uversky, V. N., Obradovic, Z., Dunker, A. K. (2007) DisProt: the database of disordered proteins. Nucl Acids Res 35, D786–D793.

    Article  CAS  PubMed  Google Scholar 

  25. Hornbeck, P. V., Chabra, I., Kornhauser, J. M., Skrzypek, E., Zhang, B. (2004) PhosphoSite: a bioinformatics resource dedicated to physiological protein phosphorylation. Proteomics 4, 1551–1561.

    Article  CAS  PubMed  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2010 Humana Press, a part of Springer Science+Business Media, LLC

About this protocol

Cite this protocol

Rebhan, M. (2010). Protein Sequence Databases. In: Carugo, O., Eisenhaber, F. (eds) Data Mining Techniques for the Life Sciences. Methods in Molecular Biology, vol 609. Humana Press. https://doi.org/10.1007/978-1-60327-241-4_3

Download citation

  • DOI: https://doi.org/10.1007/978-1-60327-241-4_3

  • Published:

  • Publisher Name: Humana Press

  • Print ISBN: 978-1-60327-240-7

  • Online ISBN: 978-1-60327-241-4

  • eBook Packages: Springer Protocols

Publish with us

Policies and ethics