Review of Issues and Solutions to Data Analysis Reproducibility and Data Quality in Clinical Proteomics

  • Mathias Walzer
  • Juan Antonio VizcaínoEmail author
Part of the Methods in Molecular Biology book series (MIMB, volume 2051)


In any analytical discipline, data analysis reproducibility is closely interlinked with data quality. In this book chapter focused on mass spectrometry-based proteomics approaches, we introduce how both data analysis reproducibility and data quality can influence each other and how data quality and data analysis designs can be used to increase robustness and improve reproducibility. We first introduce methods and concepts to design and maintain robust data analysis pipelines such that reproducibility can be increased in parallel. The technical aspects related to data analysis reproducibility are challenging, and current ways to increase the overall robustness are multifaceted. Software containerization and cloud infrastructures play an important part.

We will also show how quality control (QC) and quality assessment (QA) approaches can be used to spot analytical issues, reduce the experimental variability, and increase confidence in the analytical results of (clinical) proteomics studies, since experimental variability plays a substantial role in analysis reproducibility. Therefore, we give an overview on existing solutions for QC/QA, including different quality metrics, and methods for longitudinal monitoring. The efficient use of both types of approaches undoubtedly provides a way to improve the experimental reliability, reproducibility, and level of consistency in proteomics analytical measurements.

Key words

Computational mass spectrometry Quality control approaches Large scale data analysis Cloud technology Reproducible analysis pipelines 



Bovine serum albumin


Common workflow language


Data Access Compliance


Data Access Compliance Office


Data-dependent acquisition


Data-independent acquisition


False discovery rate


Graphical user interface


High-performance computing


Human Proteome Organization


Liquid chromatography


Lower control level


Mass spectrometry


Pan-cancer analysis of whole genomes


Proteomics Standards Initiative


Quality assessment


Quality control


Standard operating procedure


Statistical process control


Selected reaction monitoring


Upper control level


Workflow management system



The authors would wish to acknowledge funding from ELIXIR Implementation Studies, BBSRC [grant number BB/P024599/1], Wellcome Trust [grant number 208391/Z/17/Z], and EMBL core funding.


  1. 1.
    Meo AD et al (2014) What is wrong with clinical proteomics? Clin Chem 60:1258–1266CrossRefGoogle Scholar
  2. 2.
    Foster JM et al (2011) A posteriori quality control for the curation and reuse of public proteomics data. Proteomics 11(11):2182–2194CrossRefGoogle Scholar
  3. 3.
    Klont F et al (2018) Assessment of sample preparation bias in mass spectrometry-based proteomics. Anal Chem 90:5405–5413CrossRefGoogle Scholar
  4. 4.
    Apweiler R et al (2009) Approaching clinical proteomics: current state and future fields of application in fluid proteomics. Clin Chem Lab Med 47:724–744CrossRefGoogle Scholar
  5. 5.
    Cairns DA et al (2008) Integrated multi-level quality control for proteomic profiling studies using mass spectrometry. BMC Bioinformatics 9:519CrossRefGoogle Scholar
  6. 6.
    Dogu E et al (2017) MSstatsQC: longitudinal system suitability monitoring and quality control for targeted proteomic experiments. Mol Cell Proteomics 16:1335–1347CrossRefGoogle Scholar
  7. 7.
    Clough T et al (2012) Statistical protein quantification and significance analysis in label-free LC-MS experiments with complex designs. BMC Bioinformatics 13(Suppl 1):S6CrossRefGoogle Scholar
  8. 8.
    Piehowski PD et al (2013) Sources of technical variability in quantitative LC−MS proteomics: human brain tissue sample analysis. J Proteome Res 12(5):2128–2137CrossRefGoogle Scholar
  9. 9.
    Villanueva J, Carrascal M, Abian J (2014) Isotope dilution mass spectrometry for absolute quantification in proteomics: concepts and strategies. J Proteome 96:184–199CrossRefGoogle Scholar
  10. 10.
    Easing the burden of code review (2018) Nat Methods 15(9):641Google Scholar
  11. 11.
    Kanwal S et al (2017) Investigating reproducibility and tracking provenance - a genomic workflow case study. BMC Bioinformatics 18:1–14CrossRefGoogle Scholar
  12. 12.
    Leprevost FD et al (2017) BioContainers: an open-source and community-driven framework for software standardization. Bioinformatics 33(16):2580–2582CrossRefGoogle Scholar
  13. 13.
    Barsnes H, Vaudel M (2018) SearchGUI: a highly adaptable common interface for proteomics search and de novo engines. J Proteome Res 17(7):2552–2555CrossRefGoogle Scholar
  14. 14.
    Cox J, Mann M (2008) MaxQuant enables high peptide identification rates, individualized p.p.b.-range mass accuracies and proteome-wide protein quantification. Nat Biotechnol 26:1367–1372CrossRefGoogle Scholar
  15. 15.
    Pluskal T et al (2010) MZmine 2: modular framework for processing, visualizing, and analyzing mass spectrometry-based molecular profile data. BMC Bioinformatics 11:395CrossRefGoogle Scholar
  16. 16.
    Kessner D et al (2008) ProteoWizard: open source software for rapid proteomics tools development. Bioinformatics 24(21):2534–2536CrossRefGoogle Scholar
  17. 17.
    Prince JT, Marcotte EM (2008) mspire: mass spectrometry proteomics in ruby. Bioinformatics 24(23):2796–2797CrossRefGoogle Scholar
  18. 18.
    Lopez-Fernandez H et al (2015) Mass-Up: an all-in-one open software application for MALDI-TOF mass spectrometry knowledge discovery. BMC Bioinformatics 16:318CrossRefGoogle Scholar
  19. 19.
    Käll L, Canterbury J, Weston J (2007) Semi-supervised learning for peptide identification from shotgun proteomics datasets. Nature 4:923–925Google Scholar
  20. 20.
    Röst HL et al (2016) OpenMS: a flexible open-source software platform for mass spectrometry data analysis. Nat Methods 13:741–748CrossRefGoogle Scholar
  21. 21.
    Ison J et al (2016) Tools and data services registry: a community effort to document bioinformatics resources. Nucleic Acids Res 44:D38–D47CrossRefGoogle Scholar
  22. 22.
    Deutsch EW et al (2015) Development of data representation standards by the human proteome organization proteomics standards initiative. J Am Med Inform Assoc 22(3):495–506PubMedPubMedCentralGoogle Scholar
  23. 23.
    Deutsch EW et al (2017) Proteomics standards initiative: fifteen years of progress and future work. J Proteome Res 16:4288–4298CrossRefGoogle Scholar
  24. 24.
    Orchard S, Hermjakob H, Apweiler R (2003) The proteomics standards initiative. Proteomics 3:1374–1376CrossRefGoogle Scholar
  25. 25.
    Martens L et al (2011) mzML—a community standard for mass spectrometry data. Mol Cell Proteomics 10:R110.000133CrossRefGoogle Scholar
  26. 26.
    Martens L, Vizcaíno JA, Banks R (2011) Quality control in proteomics. Proteomics 11:1015–1016CrossRefGoogle Scholar
  27. 27.
    Perkins DN et al (1999) Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis 20:3551–3567CrossRefGoogle Scholar
  28. 28.
    Eng JK, Jahan TA, Hoopmann MR (2013) Comet: an open-source MS/MS sequence database search tool. Proteomics 13:22–24CrossRefGoogle Scholar
  29. 29.
    Kim S, Pevzner PA (2014) MS-GF+ makes progress towards a universal database search tool for proteomics. Nat Commun 5:5277CrossRefGoogle Scholar
  30. 30.
    Fenyö D, Beavis RC (2003) A method for assessing the statistical significance of mass spectrometry-based protein identifications using general scoring schemes. Anal Chem 75:768–774CrossRefGoogle Scholar
  31. 31.
    Jones AR et al (2012) The mzIdentML data standard for mass spectrometry-based proteomics results. Mol Cell Proteomics 11:M111.014381CrossRefGoogle Scholar
  32. 32.
    Vizcaíno JA et al (2017) The mzIdentML data standard version 1.2, supporting advances in proteome informatics. Mol Cell Proteomics 16:1275–1285CrossRefGoogle Scholar
  33. 33.
    Griss J et al (2014) The mzTab data exchange format: communicating mass-spectrometry-based proteomics and metabolomics experimental results to a wider audience. Mol Cell Proteomics 13:2765CrossRefGoogle Scholar
  34. 34.
    Walzer M et al (2013) The mzQuantML data standard for mass spectrometry-based quantitative studies in proteomics. Mol Cell Proteomics 12:2332–2340CrossRefGoogle Scholar
  35. 35.
    Walzer M et al (2014) qcML: an exchange format for quality control metrics from mass spectrometry experiments. Mol Cell Proteomics 13:1905–1913CrossRefGoogle Scholar
  36. 36.
    Xu T et al (2015) ProLuCID: an improved SEQUEST-like algorithm with enhanced sensitivity and specificity. J Proteome 129:16–24CrossRefGoogle Scholar
  37. 37.
    Zhang J et al (2012) PEAKS DB: de novo sequencing assisted database search for sensitive and accurate peptide identification. Mol Cell Proteomics 11:M111.010587CrossRefGoogle Scholar
  38. 38.
    Vaudel M et al (2015) PeptideShaker enables reanalysis of MS-derived proteomics data sets. Nat Biotechnol 33:22–24CrossRefGoogle Scholar
  39. 39.
    Searle BC (2010) Scaffold: a bioinformatic tool for validating MS/MS-based proteomic studies. Proteomics 10(6):1265–1269CrossRefGoogle Scholar
  40. 40.
    Amstutz P et al (2016) Common workflow language, v1.0Google Scholar
  41. 41.
    Afgan E et al (2018) The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2018 update. Nucleic Acids Res 46:W537–W544CrossRefGoogle Scholar
  42. 42.
    Berthold MR et al (2009) KNIME - the Konstanz information miner. ACM SIGKDD Explor Newsl 11:26CrossRefGoogle Scholar
  43. 43.
    Gillet LCL et al (2012) Targeted data extraction of the MS/MS spectra generated by data independent acquisition: a new concept for consistent and accurate proteome analysis. Mol Cell Proteomics 11:1–45CrossRefGoogle Scholar
  44. 44.
    Röst HL et al (2014) OpenSWATH enables automated, targeted analysis of data-independent acquisition MS data. Nat Biotechnol 32:219–223CrossRefGoogle Scholar
  45. 45.
    Collins BC et al (2017) Multi-laboratory assessment of reproducibility, qualitative and quantitative performance of SWATH-mass spectrometry. Nat Commun 8:1–11CrossRefGoogle Scholar
  46. 46.
    Moreno P et al (2018) Galaxy-Kubernetes integration: scaling bioinformatics workflows in the cloud. bioRxiv. PreprintGoogle Scholar
  47. 47.
    Peters K et al (2018) PhenoMeNal: processing and analysis of Metabolomics data in the Cloud. bioRxiv. PreprintGoogle Scholar
  48. 48.
    Albar JP, Canals F (2013) Standardization and quality control in proteomics. J Proteome 95:1–2CrossRefGoogle Scholar
  49. 49.
    Tabb DDL et al (2010) Repeatability and reproducibility in proteomic identifications by liquid chromatography-tandem mass spectrometry. J Proteome Res 9:761–776CrossRefGoogle Scholar
  50. 50.
    Bateman A et al (2017) UniProt: the universal protein knowledgebase. Nucleic Acids Res 45:D158–D169CrossRefGoogle Scholar
  51. 51.
    Tabb DL (2013) Quality assessment for clinical proteomics. Clin Biochem 46:411–420CrossRefGoogle Scholar
  52. 52.
    Rodriguez H, Pennington SR (2018) Revolutionizing precision oncology through collaborative proteogenomics and data sharing. Cell 173:535–539CrossRefGoogle Scholar
  53. 53.
    Wang X et al (2014) QC metrics from CPTAC raw LC-MS/MS data interpreted through multivariate statistics. Anal Chem 86:2497–2509CrossRefGoogle Scholar
  54. 54.
    Bittremieux W et al (2015) iMonDB: mass spectrometry quality control through instrument monitoring. J Proteome Res 2015:150323163122004Google Scholar
  55. 55.
    Ma ZQ et al (2012) QuaMeter: multivendor performance metrics for LC-MS/MS proteomics instrumentation. Anal Chem 84:5845–5850CrossRefGoogle Scholar
  56. 56.
    Gatto L, Wen B (2018) proteoQC: an R package for proteomics data quality control. R package version 1.16.0.
  57. 57.
    Bittremieux W et al (2017) Computational quality control tools for mass spectrometry proteomics. Proteomics 17:3–4CrossRefGoogle Scholar
  58. 58.
    Rudnick PA et al (2010) Performance metrics for liquid chromatography-tandem mass spectrometry systems in proteomics analyses. Mol Cell Proteomics 9:225–241CrossRefGoogle Scholar
  59. 59.
    Bielow C, Mastrobuoni G, Kempa S (2016) Proteomics quality control – a quality control software for MaxQuant results. J Proteome Res 15(3):777–787CrossRefGoogle Scholar
  60. 60.
    Chiva C et al (2018) QCloud: a cloud-based quality control system for mass spectrometry-based proteomics laboratories. PLoS One 13:1–14CrossRefGoogle Scholar
  61. 61.
    Köcher T et al (2011) Quality control in LC-MS/MS. Proteomics 11:1026–1030CrossRefGoogle Scholar
  62. 62.
    Bramwell D (2013) An introduction to statistical process control in research proteomics. J Proteome 95:3–21CrossRefGoogle Scholar
  63. 63.
    Pichler P et al (2012) SIMPATIQCO: a server-based software suite which facilitates monitoring the time course of LC-MS performance metrics on orbitrap instruments. J Proteome Res 11:5540CrossRefGoogle Scholar
  64. 64.
    Bereman M et al (2014) Implementation of statistical process control for proteomic experiments via LC MS/MS. J Am Soc Mass Spectrom 25:581–587CrossRefGoogle Scholar
  65. 65.
    Dong M, Paul R, Gershanov L (2001) Getting the perfect peaks: system suitability for HPLC. Todays Chemist At Work 10(9):38–42Google Scholar
  66. 66.
    Shewhart WA (1939) Statistical method from the viewpoint of quality control. Department of Agriculture, Washington, DC, pp 1–7Google Scholar
  67. 67.
    Western Electric (1958) Statistical quality control handbook. Western Electric, IndianapolisGoogle Scholar
  68. 68.
    Westgard JO, Barry PL, Hunt MR (1981) A multi-rule Shewart chart for quality control in clinical chemistry. Clin Chem 27:493–501PubMedGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2020

Authors and Affiliations

  1. 1.European Molecular Biology LaboratoryEuropean Bioinformatics Institute (EMBL-EBI)CambridgeUK

Personalised recommendations