Skip to main content

Efficient Transformation of Protein Sequence Databases to Columnar Index Schema

  • Conference paper
  • First Online:
Database and Expert Systems Applications (DEXA 2019)

Abstract

Mass spectrometry is used to sequence proteins and extract bio-markers of biological environments. These bio-markers can be used to diagnose thousands of diseases and optimize biological environments such as bio-gas plants. Indexing of the protein sequence data allows to streamline the experiments and speed up the analysis. In our work, we present a schema for distributed column-based database management systems using a column-oriented index to store sequence data. This leads to the problem, how to transform the protein sequence data from the standard format to the new schema. We analyze four different methods of transformation and evaluate those four different methods. The results show that our proposed extended radix tree has the best performance regarding memory consumption and calculation time. Hence, the radix tree is proved to be a suitable data structure for the transformation of protein sequences into the indexed schema.

Supported by organization de.NBI and Bruker Daltonik GmbH.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Deutsch, E.W.: File formats commonly used in mass spectrometry proteomics. Mol. Cell. Proteomics 11(12), 1612–1621 (2012)

    Article  Google Scholar 

  2. Heyer, R., et al.: Metaproteomics of complex microbial communities in biogas plants. Microb. Technol. 8, 749–763 (2015)

    Google Scholar 

  3. Heyer, R., et al.: Challenges and perspectives of metaproteomic data analysis. J. Biotechnol. 261(Suppl. C), 24–36 (2017)

    Article  Google Scholar 

  4. https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=BlastHelp:. Fasta format, November 2002

  5. Leis, V., et al.: The adaptive radix tree: artful indexing for main-memory databases. In: IEEE International Conference on Data Engineering (ICDE 2013), pp. 38–49 (2013)

    Google Scholar 

  6. Millioni, R., et al.: Pros and cons of peptide isolectric focusing in shotgun proteomics. J. Chromatogr. A 1293, 1–9 (2013)

    Article  Google Scholar 

  7. Petriz, B.A., et al.: Metaproteomics as a complementary approach to gut microbiota in health and disease. Front. Chem. 5, 4 (2017)

    Article  Google Scholar 

  8. Shishibori, M., et al.: An efficient compression method for patricia tries. In: 1997 IEEE International Conference on Systems, Man, and Cybernetics. Computational Cybernetics and Simulation, vol. 1, pp. 415–420, October 1997

    Google Scholar 

  9. Zoun, R., et al.: Protein identification as a suitable application for fast data architecture. In: International Workshop on Biological Knowledge Discovery and Data Mining (BIOKDD-DEXA). IEEE, September 2018

    Google Scholar 

  10. Zoun, R., et al.: Msdatastream - connecting a bruker mass spectrometer to the internet. In: Datenbanksysteme für Business, Technologie und Web, March 2019

    Google Scholar 

Download references

Acknowledgments

The authors sincerely thank Niya Zoun, Gabriel Cam-pero Durand, Marcus Pinnecke, Sebastian Krieter, Sven Helmer, Sven Brehmer and Andreas Meister for their support and advice. This work is partly funded by the BMBF (Fkz: 031L0103), the European Regional Development Fund (no.: 11.000sz00.00.0 17 114347 0), the DFG (grant no.: SA 465/50-1), by the German Federal Ministry of Food and Agriculture (grants no.: 22404015) and dedicated to the memory of Mikhail Zoun.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Roman Zoun .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Zoun, R. et al. (2019). Efficient Transformation of Protein Sequence Databases to Columnar Index Schema. In: Anderst-Kotsis, G., et al. Database and Expert Systems Applications. DEXA 2019. Communications in Computer and Information Science, vol 1062. Springer, Cham. https://doi.org/10.1007/978-3-030-27684-3_10

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-27684-3_10

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-27683-6

  • Online ISBN: 978-3-030-27684-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics