Protein Identification as a Suitable Application for Fast Data Architecture

  • Roman ZounEmail author
  • Gabriel Campero Durand
  • Kay Schallert
  • Apoorva Patrikar
  • David Broneske
  • Wolfram Fenske
  • Robert Heyer
  • Dirk Benndorf
  • Gunter Saake
Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 903)


Metaproteomics is a field of biology research that relies on mass spectrometry to characterize the protein complement of microbiological communities. Since only identified data can be analyzed, identification algorithms such as X!Tandem, OMSSA and Mascot are essential in the domain, to get insights into the biological experimental data. However, protein identification software has been developed for proteomics. Metaproteomics, in contrast, involves large biological communities, gigabytes of experimental data per sample, and greater amounts of comparisons, given the mixed culture of species in the protein database. Furthermore, the file-based nature of current protein identification tools makes them ill-suited for future metaproteomics research. In addition, possible medical use cases of metaproteomics require near real-time identification. From the technology perspective, Fast Data seems promising to increase throughput and performance of protein identification in a metaproteomics workflow. In this paper we analyze the core functions of the established protein identification engine X!Tandem and show that streaming Fast Data architectures are suitable for protein identification. Furthermore, we point out the bottlenecks of the current algorithms and how to remove them with our approach.


Fast Data Bioinformatics Metaproteomics Proteomics Cloud computing Protein identification 



The authors sincerely thank Xiao Chen, Sebastian Krieter, Andreas Meister and Marcus Pinnecke for their support and advice. This work is partly funded by the de.NBI Network (031L0103), the DFG (grant no.: SA 465/50-1), the European Regional Development Fund (grant no. 11.000sz00.00.017 114347 0), the German Federal Ministry of Food and Agriculture (grants nos. 22404015) and dedicated to the memory of Mikhail Zoun.


  1. 1.
    Ahmad, Y., Çetintemel, U.: Streaming applications. In: Liu, L., Tamer Özsu, M. (eds.) Encyclopedia of Database Systems, pp. 2847–2848. Springer, Heidelberg (2009). Scholar
  2. 2.
    Apweiler, R., et al.: UniProt: the universal protein knowledgebase. Nucleic Acids Res. 32, 115–119 (2004)CrossRefGoogle Scholar
  3. 3.
    Balgley, B.M., Laudeman, T., Yang, L., Song, T., Lee, C.S.: Comparative evaluation of tandem MS search algorithms using a target-decoy search strategy. Mol. Cell. Proteomics 6(9), 1599–1608 (2007)CrossRefGoogle Scholar
  4. 4.
    Banerjee, S., Mazumdar, S.: Electrospray ionization mass spectrometry: a technique to access the information beyond the molecular weight of the analyte. Int. J. Anal. Chem. 2012 (2012).
  5. 5.
    Baumgardner, L., Shanmugam, A., Lam, H., Eng, J., Martin, D.: Fast parallel tandem mass spectral library searching using GPU hardware acceleration. J. Proteome Res. (2011).
  6. 6.
    National Center for Biotechnology Information: Fasta format, November 2002.
  7. 7.
    Cottrell, J.S., London, U.: Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis 20(18), 3551–3567 (1999)CrossRefGoogle Scholar
  8. 8.
    Craig, R., Beavis, R.C.: A method for reducing the time required to match protein sequences with tandem mass spectra. Rapid Commun. Mass Spectrom. 17(20), 2310–2316 (2003).
  9. 9.
    Deutsch, E.W.: File formats commonly used in mass spectrometry proteomics. Mol. Cell. Proteomics 11(12), 1612–1621 (2012)CrossRefGoogle Scholar
  10. 10.
    Duncan, M.W., Aebersold, R., Caprioli, R.M.: The pros and cons of peptide-centric proteomics. Nat. Biotechnol. (2010).
  11. 11.
    Elias, J., Gygi, S.: Target-decoy search strategy for mass spectrometry-based proteomics. Methods Mol. Biol. 604, 55–71 (2010). Scholar
  12. 12.
    Eng, J.K., McCormack, A.L., Yates, J.R.: An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. J. Am. Soc. Mass Spectrom. 5(11), 976–989 (1994)CrossRefGoogle Scholar
  13. 13.
    Estrada, R.: Fast Data Processing Systems with SMACK Stack. Packt Publishing, Birmingham (2016)Google Scholar
  14. 14.
    Griss, J., et al.: Recognizing millions of consistently unidentified spectra across hundreds of shotgun proteomics datasets. Nat. Methods (2016).
  15. 15.
    Heyer, R., Kohrs, F., Reichl, U., Benndorf, D.: Metaproteomics of complex microbial communities in biogas plants. Microb. Technol. 8 (2015).
  16. 16.
    Seidler, J., Zinn, N., Boehm, M.E., Lehmann, W.D.: De novo sequencing of peptides by MS/MS. Proteomics (2009).
  17. 17.
    Kipf, A., Pandey, V., Boettcher, J., Braun, L., Neumann, T., Kemper, A.: Analytics on fast data: main-memory database systems versus modern streaming systems. In: 20th International Conference on Extending Database Technology (2017)Google Scholar
  18. 18.
    Kokaly, R., et al.: USGS spectral library version 7. Technical report, U.S. Geological Survey Data Series 1035 (2017).
  19. 19.
    Lubeck, M., et al.: Pasef\(^{\rm TM}\) on a timstof pro defines new performance standards for shotgun proteomics with dramatic improvements in MS/MS data acquisition rates and sensitivity. Technical report, Bruker Daltonik GmbH (2017)Google Scholar
  20. 20.
    Maron, P.A., Ranjard, L., Mougel, C., Lemanceau, P.: Metaproteomics: a new approach for studying functional microbial ecology. Microb. Ecol. 53, 486–493 (2007)CrossRefGoogle Scholar
  21. 21.
    McDonald, W.H., et al.: MS1, MS2, and SQT-three unified, compact, and easily parsed file formats for the storage of shotgun proteomic spectra and identifications. Rapid Commun. Mass Spectrom. 18(18), 2162–2168 (2004).
  22. 22.
    Millioni, R., Franchin, C., Tessari, P., Polati, R., Cecconi, D., Arrigoni, G.: Pros and cons of peptide isolectric focusing in shotgun proteomics. J. Chromatogr. A 1293, 1–9 (2013).
  23. 23.
    Ondov, B.D., Bergman, N.H., Phillippy, A.M.: Interactive metagenomic visualization in a web browser. BMC Bioinform. 12(1), 385 (2011).
  24. 24.
    Petriz, B.A., Franco, O.L.: Metaproteomics as a complementary approach to gut microbiota in health and disease. Front. Chem. (2017).
  25. 25.
    Pratt, B., Howbert, J.J., Tasman, N.I., Nilsson, E.J.: MR-Tandem: parallel X!Tandem using hadoop mapreduce on Amazon web services. Bioinformatics (2012).
  26. 26.
    Craig, R., Beavis, R.C.: A method for reducing the time required to match protein sequences with tandem mass spectra. Rapid Commun. Mass Spectrom. 17, 2310–2316 (2003)Google Scholar
  27. 27.
    Matrix Science: Data file format (2016).
  28. 28.
    Wampler, D.: Fast data: big data evolved. White Paper (2015)Google Scholar
  29. 29.
    Wampler, D.: Fast Data Architectures for Streaming Applications, 1st edn. O’Reilly Media, Sebastopol (2016)Google Scholar
  30. 30.
    Zhang, J., Liang, Y., Yau, P., Pandey, R., Harpalani, S.: A metaproteomic approach for identifying proteins in anaerobic bioreactors converting coal to methane. Int. J. Coal Geol. 146, 91–103 (2015)Google Scholar
  31. 31.
    Zoun, R., Schallert, K., Broneske, D., Heyer, R., Benndorf, D., Saake, G.: Interactive chord visualization for metaproteomics. In: 28th International Workshop on Database and Expert Systems Applications (DEXA), pp. 79–83, August 2017.

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  • Roman Zoun
    • 1
    Email author
  • Gabriel Campero Durand
    • 1
  • Kay Schallert
    • 2
  • Apoorva Patrikar
    • 3
  • David Broneske
    • 1
  • Wolfram Fenske
    • 1
  • Robert Heyer
    • 2
  • Dirk Benndorf
    • 2
  • Gunter Saake
    • 1
  1. 1.Working Group Databases and Software EngineeringUniversity of MagdeburgMagdeburgGermany
  2. 2.Chair of Bioprocess EngineeringUniversity of MagdeburgMagdeburgGermany
  3. 3.Accenture GmbHKronberg im TaunusGermany

Personalised recommendations