Proteomics Databases and Repositories

  • Lennart MartensEmail author
Part of the Methods in Molecular Biology book series (MIMB, volume 694)


With the advent of more powerful and sensitive analytical techniques and instruments, the field of mass spectrometry based proteomics has seen a considerable increase in the amount of generated data. Correspondingly, the need to make these data publicly available in centralized online databases has also become more pressing. As a result, several such databases have been created, and steps are currently being taken to integrate these different systems under a single worldwide data-sharing umbrella. This chapter will discuss the importance of such databases and the necessary infrastructure that these databases require for efficient operation. Furthermore, the various kinds of information that proteomics databases can store will be described, along with the different types of databases that are available today. Finally, a selection of prominent repositories will be described in more detail, together with the international ProteomExchange consortium that is aimed at uniting all the different databases in a global data sharing collaboration.

Key words

Proteomics Mass spectrometry Identifications Database Repository ProteomExchange 



The author would like to thank Henning Hermjakob and Rolf Apweiler for their support.


  1. 1.
    Gevaert K., Van Damme P., Ghesquière B., Impens F., Martens L., Helsens K. et al. (2007) A la carte proteomics with an emphasis on gel-free techniques. Proteomics 7, 2698–2718.PubMedCrossRefGoogle Scholar
  2. 2.
    Domon B. and Aebersold R. (2006) Mass spectrometry and protein analysis. Science 312, 212–217.PubMedCrossRefGoogle Scholar
  3. 3.
    Hubbard T., Aken B., Ayling S., Ballester B., Beal K., Bragin E. et al. (2009) Ensembl 2009. Nucleic Acids Res 37, D690–D607.PubMedCrossRefGoogle Scholar
  4. 4.
    The UniProt Consortium (2009) The Universal Protein Resource (UniProt) 2009. Nucleic Acids Res 37, D169–D174.CrossRefGoogle Scholar
  5. 5.
    Aebersold R. and Mann M. (2003) Mass spectrometry-based proteomics. Nature 422, 198–207.PubMedCrossRefGoogle Scholar
  6. 6.
    Martens L. and Hermjakob H. (2007) Proteomics data validation: why all must provide data. Mol Biosyst 3, 518–522.PubMedCrossRefGoogle Scholar
  7. 7.
    Martens L., Nesvizhskii A.I., Hermjakob H., Adamski M., Omenn G.S., Vandekerckhove J. et al. (2005) Do we want our data raw? Including binary mass spectrometry data in public proteomics data repositories. Proteomics 5, 3501–3505.PubMedCrossRefGoogle Scholar
  8. 8.
    Prince J.T., Carlson M.W., Wang R., Lu P. and Marcotte E.M. (2004) The need for a public proteomics repository. Nat Biotechnol 22, 471–472.PubMedCrossRefGoogle Scholar
  9. 9.
    Mead J., Bianco L. and Bessant C. (2009) Recent developments in public proteomic MS repositories and pipelines. Proteomics 9, 861–881.PubMedCrossRefGoogle Scholar
  10. 10.
    Bernstein F.C., Koetzle T.F., Williams G.J., Meyer E.F.J., Brice M.D., Rodgers J.R. et al. (1977) The Protein Data Bank: a computer-based archival file for macromolecular structures. J Mol Biol 112, 535–542.PubMedCrossRefGoogle Scholar
  11. 11.
    Berman H. (2008) The Protein Data Bank: a historical perspective. Acta Crystallogr 64, 88–95.CrossRefGoogle Scholar
  12. 12.
    Lander E.S., Linton L.M., Birren B., Nusbaum C., Zody M.C., Baldwin J. et al. (2001) Initial sequencing and analysis of the human genome. Nature 409, 860–921.PubMedCrossRefGoogle Scholar
  13. 13.
    Parkinson H., Kapushesky M., Shojatalab M., Abeygunawardena N., Coulson R., Farne A. et al. (2007) ArrayExpress – a public database of microarray experiments and gene expression profiles. Nucleic Acids Res 35, D747–D750.PubMedCrossRefGoogle Scholar
  14. 14.
    Berman H., Henrick K., Nakamura H. and Markley J. (2007) The worldwide Protein Data Bank (wwPDB): ensuring a single, uniform archive of PDB data. Nucleic Acids Res 35, D301–D303.PubMedCrossRefGoogle Scholar
  15. 15.
    Chatr-aryamontri A., Ceol A., Palazzi L., Nardelli G., Schneider M., Castagnoli L. et al. (2007) MINT: the Molecular INTeraction database. Nucleic Acids Res 35, D572–D574.PubMedCrossRefGoogle Scholar
  16. 16.
    Kerrien S., Alam-Faruque Y., Aranda B., Bancarz I., Bridge A., Derow C. et al. (2007) IntAct – open source resource for molecular interaction data. Nucleic Acids Res 35, D561–D565.PubMedCrossRefGoogle Scholar
  17. 17.
    Degtyarenko K., de Matos P., Ennis M., Hastings J., Zbinden M., McNaught A. et al. (2008) ChEBI: a database and ontology for chemical entities of biological interest. Nucleic Acids Res 36, D344–D350.PubMedCrossRefGoogle Scholar
  18. 18.
    Vizcaíno J., Mueller M., Hermjakob H. and Martens L. (2009) Charting online OMICS resources: a navigational chart for clinical researchers. Proteomics Clin Appl 3, 18–29.PubMedCrossRefGoogle Scholar
  19. 19.
    Kapp E.A., Schütz F., Connolly L.M., Chakel J.A., Meza J.E., Miller C.A. et al. (2005) An evaluation, comparison, and accurate benchmarking of several publicly available MS/MS search algorithms: sensitivity and specificity analysis. Proteomics 5, 3475–3490.PubMedCrossRefGoogle Scholar
  20. 20.
    Reidegeld K.A., Muller M., Stephan C., Bluggel M., Hamacher M., Martens L. et al. (2006) The power of cooperative investigation: summary and comparison of the HUPO Brain Proteome Project pilot study results. Proteomics 6, 4997–5014.PubMedCrossRefGoogle Scholar
  21. 21.
    Klie S., Martens L., Vizcaíno J.A., Côté R., Jones P., Apweiler R. et al. (2008) Analyzing large-scale proteomics projects with latent semantic indexing. J Proteome Res 7, 182–191.PubMedCrossRefGoogle Scholar
  22. 22.
    Mueller M., Vizcaíno J.A., Jones P., Côté R., Thorneycroft D., Apweiler R. et al. (2008) Analysis of the experimental detection of central nervous system related genes in human brain and cerebrospinal fluid datasets. Proteomics 8, 1138–1148.PubMedCrossRefGoogle Scholar
  23. 23.
    Martens L., Muller M., Stephan C., Hamacher M., Reidegeld K.A., Meyer H.E. et al. (2006) A comparison of the HUPO Brain Proteome Project pilot with other proteomics studies. Proteomics 6, 5076–5086.PubMedCrossRefGoogle Scholar
  24. 24.
    Martens L., Orchard S., Apweiler R. and Hermjakob H. (2007) Human Proteome Organization Proteomics Standards Initiative: data standardization, a view on developments and policy. Mol Cell Proteomics 6, 1666–1667.PubMedGoogle Scholar
  25. 25.
    Carr S., Aebersold R., Baldwin M., Burlingame A., Clauser K. and Nesvizhskii A. (2004) The need for guidelines in publication of peptide and protein identification data: Working Group on Publication Guidelines for Peptide and Protein Identification Data. Mol Cell Proteomics 3, 531–533.PubMedCrossRefGoogle Scholar
  26. 26.
    Taylor C.F., Binz P., Aebersold R., Affolter M., Barkovich R., Deutsch E.W. et al. (2008) Guidelines for reporting the use of mass spectrometry in proteomics. Nat Biotechnol 26, 860–861.PubMedCrossRefGoogle Scholar
  27. 27.
    Deutsch E. (2008) mzML: a single, unifying data format for mass spectrometer output. Proteomics 8, 2776–2777.PubMedCrossRefGoogle Scholar
  28. 28.
    Sadygov R.G., Cociorva D. and Yates J.R. (2004) Large-scale database searching using tandem mass spectra: looking up the answer in the back of the book. Nat Methods 1, 195–202.PubMedCrossRefGoogle Scholar
  29. 29.
    Nesvizhskii A.I., Vitek O. and Aebersold R. (2007) Analysis and validation of proteomic data generated by tandem mass spectrometry. Nat Methods 4, 787–797.PubMedCrossRefGoogle Scholar
  30. 30.
    Keller A., Nesvizhskii A.I., Kolker E. and Aebersold R. (2002) Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search. Anal Chem 74, 5383–5392.PubMedCrossRefGoogle Scholar
  31. 31.
    Helsens K., Timmerman E., Vandekerckhove J., Gevaert K. and Martens L. (2008) Peptizer: A tool for assessing false positive peptide identifications and manually validating selected results. Mol Cell Proteomics 7, 2364–2372.PubMedCrossRefGoogle Scholar
  32. 32.
    Nesvizhskii A.I. and Aebersold R. (2005) Interpretation of shotgun proteomic data: the protein inference problem. Mol Cell Proteomics 4, 1419–1440.PubMedCrossRefGoogle Scholar
  33. 33.
    Babnigg G. and Giometti C.S. (2006) A database of unique protein sequence identifiers for proteome studies. Proteomics 6, 4514–4522.PubMedCrossRefGoogle Scholar
  34. 34.
    Côté R.G., Jones P., Martens L., Kerrien S., Reisinger F., Lin Q. et al. (2007) The Protein Identifier Cross-Reference (PICR) service: reconciling protein identifiers across multiple source databases. BMC Bioinformatics 8, 401.PubMedCrossRefGoogle Scholar
  35. 35.
    Panchaud A., Affolter M., Moreillon P. and Kussmann M. (2008) Experimental and computational approaches to quantitative proteomics: status quo and outlook. J Proteomics 71, 19–33.PubMedCrossRefGoogle Scholar
  36. 36.
    Mueller L.N., Brusniak M., Mani D.R. and Aebersold R. (2008) An assessment of software solutions for the analysis of mass spectrometry based quantitative proteomics data. J Proteome Res 7, 51–61.PubMedCrossRefGoogle Scholar
  37. 37.
    Siepen J.A., Swainston N., Jones A.R., Hart S.R., Hermjakob H., Jones P. et al. (2007) An informatic pipeline for the data capture and submission of quantitative proteomic data using iTRAQTM. Proteome Sci 5, 4.PubMedCrossRefGoogle Scholar
  38. 38.
    Klammer A.A., Reynolds S.M., Bilmes J.A., MacCoss M.J. and Noble W.S. (2008) Modeling peptide fragmentation with dynamic Bayesian networks for peptide identification. Bioinformatics 24, i348–i356.PubMedCrossRefGoogle Scholar
  39. 39.
    Mallick P., Schirle M., Chen S.S., Flory M.R., Lee H., Martin D. et al. (2007) Computational prediction of proteotypic peptides for quantitative proteomics. Nat Biotechnol 25, 125–131.PubMedCrossRefGoogle Scholar
  40. 40.
    Anonymous. (2008) Thou shalt share your data. Nat Methods 5, 209–209.Google Scholar
  41. 41.
    Anonymous. (2007) Democratizing proteomics data. Nat Biotechnol 25, 262.Google Scholar
  42. 42.
    Anonymous. (2007) Compete, collaborate, compel. Nat Genet 39, 931.Google Scholar
  43. 43.
    Mead J.A., Shadforth I.P. and Bessant C. (2007) Public proteomic MS repositories and pipelines: available tools and biological applications. Proteomics 7, 2769–2786.PubMedCrossRefGoogle Scholar
  44. 44.
    Craig R., Cortens J. and Beavis R. (2004) Open source system for analyzing, validating, and storing protein identification data. J Proteome Res 3, 1234–1242.PubMedCrossRefGoogle Scholar
  45. 45.
    Craig R. and Beavis R.C. (2004) TANDEM: matching proteins with tandem mass spectra. Bioinformatics 20, 1466–1467.PubMedCrossRefGoogle Scholar
  46. 46.
    Desiere F., Deutsch E.W., Nesvizhskii A.I., Mallick P., King N.L., Eng J.K. et al. (2005) Integration with the human genome of peptide sequences obtained by high-throughput mass spectrometry. Genome Biol 6, R9.PubMedCrossRefGoogle Scholar
  47. 47.
    Lam H., Deutsch E.W., Eddes J.S., Eng J.K., King N., Stein S.E. et al. (2007) Development and validation of a spectral library searching method for peptide identification from MS/MS. Proteomics 7, 655–667.PubMedCrossRefGoogle Scholar
  48. 48.
    Deutsch E., Lam H. and Aebersold R. (2008) PeptideAtlas: a resource for target selection for emerging targeted proteomics workflows. EMBO Rep 9, 429–434.PubMedCrossRefGoogle Scholar
  49. 49.
    Van P.T., Schmid A.K., King N.L., Kaur A., Pan M., Whitehead K. et al. (2008) Halobac-terium salinarum NRC-1 PeptideAtlas: toward strategies for targeted proteomics and improved proteome coverage. J Proteome Res 7, 3755–3764.PubMedCrossRefGoogle Scholar
  50. 50.
    Loevenich S.N., Brunner E., King N.L., Deutsch E.W., Stein S.E. et al. (2009) The Drosophila melanogaster PeptideAtlas facilitates the use of peptide data for improved fly proteomics and genome annotation. BMC Bioinformatics 10, 59.PubMedCrossRefGoogle Scholar
  51. 51.
    Deutsch E.W., Eng J.K., Zhang H., King N.L., Nesvizhskii A.I., Lin B. et al. (2005) Human Plasma PeptideAtlas. Proteomics 5, 3497–3500.PubMedCrossRefGoogle Scholar
  52. 52.
    Martens L., Hermjakob H., Jones P., Adamski M., Taylor C., States D. et al. (2005) PRIDE: the proteomics identifications database. Proteomics 5, 3537–3545.PubMedCrossRefGoogle Scholar
  53. 53.
    Barsnes H., Vizcaíno J.A., Eidhammer I. and Martens L. (2009) PRIDE Converter: making proteomics data-sharing easy. Nat Biotechnol 27, 598–599.PubMedCrossRefGoogle Scholar
  54. 54.
    Jones P., Cote R., Cho S., Klie S., Martens L., Quinn A. et al. (2008) PRIDE: new developments and new datasets. Nucleic Acids Res 36, D878–D883.PubMedCrossRefGoogle Scholar
  55. 55.
    Mathivanan S., Ahmed M., Ahn N.G., Alexandre H., Amanchy R., Andrews P.C. et al. (2008) Human Proteinpedia enables sharing of human protein data. Nat Biotechnol 26, 164–167.PubMedCrossRefGoogle Scholar
  56. 56.
    Mishra G.R., Suresh M., Kumaran K., Kannabiran N., Suresh S., Bala P. et al. (2006) Human protein reference database – 2006 update. Nucleic Acids Res 34, D411–D414.PubMedCrossRefGoogle Scholar
  57. 57.
    Slotta D.J., Barrett T. and Edgar R. (2009) NCBI Peptidome: a new public repository for mass spectrometry peptide identifications. Nat Biotechnol 27, 600–601.PubMedCrossRefGoogle Scholar
  58. 58.
    Falkner J.A., Hill J.A. and Andrews P.C. (2008) Proteomics FASTA archive and reference resource. Proteomics 8, 1756–1757.PubMedCrossRefGoogle Scholar
  59. 59.
    Hermjakob H. and Apweiler R. (2006) The Proteomics Identifications Database (PRIDE) and the ProteomExchange Consortium: making proteomics data accessible. Expert Rev Proteomics 3, 1–3.PubMedCrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2011

Authors and Affiliations

  1. 1.EMBL OutstationEuropean Bioinformatics Institute (EBI)CambridgeUK

Personalised recommendations