Encyclopedia of Systems Biology

2013 Edition
| Editors: Werner Dubitzky, Olaf Wolkenhauer, Kwang-Hyun Cho, Hiroki Yokota

Proteome Analysis Pipeline

  • Lars Malmström
  • Andreas Quandt
  • Ela Pustulka-Hunt
Reference work entry
DOI: https://doi.org/10.1007/978-1-4419-9863-7_1002

Definition

A proteome analysis pipeline analyzes the data acquired during a proteomics experiment and produces results that are interpretable and accessible to the researcher, and could be used for publication or deposition in a proteomics resource.

Characteristics

A proteome data analysis pipeline interprets proteomics data and makes it available for further scrutiny. The interpretation uses several computer applications, some of which involve only a single experiment and can be executed in parallel, whereas some rely on data from several experiments and are only applied once per analysis. The analysis often involves very large amounts of data (GB range), uses nontrivial amounts of computing power (on a compute cluster), and may be organized as a workflow. The workflow depends on what the experiment is designed to answer, is subject to frequent modifications, and may involve parameter sweeps. It is advantageous to specify the computational protocol as a workflow where each workflow is designed for a single purpose. Parts of the workflow can be reused in other workflows.

Example

Protein identification is one of the standard data analysis procedures used in proteomics. The Trans-Proteomic Pipeline (Keller 2005; Deutsch et al. 2010; TPP 2011) uses protein identification as one of the processing steps (Fig. 1 and 2). In protein identification, first, a search engine such as X!Tandem (Craig and Beavis 2004) infers peptides from spectra in each data file. This results in a proposed peptide assignment for each spectrum and a score that evaluates how well the peptide inference explains the data in the spectrum. This is done in parallel for each input file. The scores are evaluated statistically in a manner specific to the search engine used for inference. Here, scores for all the spectra from various search engines are considered and unified across all the samples. The next step is to assemble the peptide list into the most likely list of proteins and present these proteins with the evidence supporting the most likely assignment. The result can be submitted to a protein resource, such as PRIDE (Vizcaíno et al. 2009) and cited in publications.
Proteome Analysis Pipeline, Fig. 1

TPP identification pipeline accepts a list of parameters used in parameter sweeps and a list of mzXML (mzXML 2011) files. The input files are searched against a database of known protein sequences, with the parameter sets as specified. The results are unified and merged

Proteome Analysis Pipeline, Fig. 2

An example architecture for proteomics data analysis (Kunszt et al. 2011). Mass spectrometers produce data files which are then moved automatically to a Data Storage Server (DSS-Server). openBIS (Bauch et al. 2011) database picks the data up from the DSS via a dropbox. P-Grade workflows pick up the data from the DSS, triggered by the user who submits an analysis, and the request for processing is sent to the computational infrastructure (imsb-ra-mascot and Brutus). The analysis results are sent via DSS to openBIS. Hierarchical Storage Management (HSM) supports data archival and communicates with other subsystems via DSS

The most commonly used packages in protein identification are MASCOT (Perkins et al. 1999), SEQUEST (Eng et al. 1994), OMSSA (Geer et al. 2004), X!Tandem, SpectrumMill (Agilent 2011), OLAV (Colinge et al. 2003), and TPP.

Cross-References

References

  1. Agilent (2011) www.agilent.com. Accessed 25 May 2011
  2. Bauch A, Adamczyk I, Buczek P, Elmer F-J, Enimanev K, Glyzewski P, Kohler M, Pylak T, Quandt A, Ramakrishnan C, Beisel C, Malmström L, Aebersold R, Rinn B (2011) openBIS: a flexible framework for managing and analyzing complex data in biology research. BMC Bioinformatics 12:468PubMedGoogle Scholar
  3. Colinge J, Masselot A, Giron M, Dessingy T, Magnin J (2003) OLAV: towards high-throughput tandem mass spectrometry data identification. Proteomics 3(8):1454–1463PubMedGoogle Scholar
  4. Craig R, Beavis RC (2004) TANDEM: matching proteins with mass spectra. Bioinformatics 20(9):1466–1467PubMedGoogle Scholar
  5. Deutsch EW, Mendoza L, Shteynberg D, Farrah T, Lam H, Tasman N, Sun Z, Nilsson E, Pratt B, Prazen B, Eng JK, Martin DB, Nesvizhskii AI, Aebersold R (2010) A guided tour of the Trans-Proteomic Pipeline. Proteomics 10(6):1150–1159PubMedGoogle Scholar
  6. Eng JK, McCormack AJ, Yates JR III (1994) An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. J Am Soc Mass Spectrom 5:976–989Google Scholar
  7. Geer LY, Markey SP, Kowalak JA, Wagner L, Xu M, Maynard DM, Yang X, Shi W, Bryant SH (2004) Open mass spectrometry search algorithm. J Proteome Res 3(5):958–964PubMedGoogle Scholar
  8. Keller A, Eng J, Zhang N, Li XJ, Aebersold R (2005) A uniform proteomics MS/MS analysis platform utilizing open XML file formats. Mol Syst Biol 1:2005.0017PubMedGoogle Scholar
  9. Kunszt P, Espona Pernas L, Quandt A, Schmid E, Hunt E, Malmstrom L (2011) The Swiss Grid Proteomics Portal. PARENG’11: The Second International Conference on Parallel, Distributed, Grid and Cloud Computing for Engineering. Civil-Comp Press, Stirlingshire, UK, Paper 81Google Scholar
  10. Perkins DN, Pappin DJ, Creasy DM, Cottrell JS (1999) Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis 20:3551–3567PubMedGoogle Scholar
  11. TPP (2011) Trans-Proteomics Pipeline. http://tools.proteomecenter.org/wiki/index.php?title=Software:TPP. Accessed 25 May 2011
  12. Vizcaíno JA, Côté R, Reisinger F, Foster JM, Mueller M, Rameseder J, Hermjakob H, Martens L (2009) A guide to the Proteomics Identifications Database proteomics data repository. Proteomics 9(18):4276–4283, http://www.ebi.ac.ukk/pride

Copyright information

© Springer Science+Business Media, LLC 2013

Authors and Affiliations

  • Lars Malmström
    • 1
  • Andreas Quandt
    • 1
  • Ela Pustulka-Hunt
    • 2
  1. 1.Institute of Molecular Systems Biology IMSBETH ZurichZurichSwitzerland
  2. 2.GS SystemsXETH ZurichZurichSwitzerland