, Volume 17, Issue 2, pp 139–154 | Cite as

Efficiently Storing and Analyzing Genome Data in Database Systems

  • Sebastian DorokEmail author
  • Sebastian Breß
  • Jens Teubner
  • Horstfried Läpple
  • Gunter Saake
  • Volker Markl


Genome-analysis enables researchers to detect mutations within genomes and deduce their consequences. Researchers need reliable analysis platforms to ensure reproducible and comprehensive analysis results. Database systems provide vital support to implement the required sustainable procedures. Nevertheless, they are not used throughout the complete genome-analysis process, because (1) database systems suffer from high storage overhead for genome data and (2) they introduce overhead during domain-specific analysis. To overcome these limitations, we integrate genome-specific compression into database systems using a specialized database schema. Thus, we can reduce the storage consumption of a database approach by up to 35%. Moreover, we exploit genome-data characteristics during query processing allowing us to analyze real-world data sets up to five times faster than specialized analysis tools and eight times faster than a straightforward database approach.


Main-memory database systems Genome analysis Variant calling 



The work has received funding from the German Research Foundation (DFG), Collaborative Research Center SFB 876, project C5, from the European Union’s Horizon2020 Research & Innovation Program under grant agreement 671500 (project SAGE), and by the German Ministry for Education and Research as Berlin Big Data Center BBDC (01IS14013A).


  1. 1.
    Abadi D, Madden S, Ferreira M (2006) Integrating compression and execution in column-oriented database systems. SIGMOD, pp 671–682Google Scholar
  2. 2.
    Abadi D, Madden S, Hachem N (2008) Column-stores vs. row-stores: How different are they really? SIGMOD, pp 967–980Google Scholar
  3. 3.
    Bhagwat D, Chiticariu L, Tan W-C, Vijayvargiya G (2004) An annotation management system for relational databases. VLDB, pp 900–911Google Scholar
  4. 4.
    Bloniarz A, Talwalkar A, Terhorst J et al (2014) Changepoint analysis for efficient variant calling. RECOMB, pp 20–34Google Scholar
  5. 5.
    Breß S (2014) The design and implementation of cogaDB: a column-oriented GPU-accelerated DBMS. Datenbank Spektr 14(3):199–209CrossRefGoogle Scholar
  6. 6.
    Breß S, Funke H, Teubner J (2016) Robust query processing in co-processor-accelerated databases. SIGMOD, pp 1891–1906Google Scholar
  7. 7.
    Bromberg Y (2013) Building a genome analysis pipeline to predict disease risk and prevent disease. J Mol Biol 425(21):3993–4005CrossRefGoogle Scholar
  8. 8.
    Ceri S, Kaitoua A, Masseroli M, Pinoli P, Venco F (2016) Data management for next generation genomic computing. EDBT, pp 485–490Google Scholar
  9. 9.
    Cijvat R, Manegold S, Kersten M et al (2015) Genome sequence analysis with MonetDB. Datenbank Spektrum 15(3):185–191CrossRefGoogle Scholar
  10. 10.
    Working Group (2015) CRAM Format Specification. Google Scholar
  11. 11.
    DePristo M, Banks E, Poplin R et al (2011) A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet 43(5):491–498CrossRefGoogle Scholar
  12. 12.
    Dorok S (2016) Memory efficient processing of DNA sequences in relational main-memory database systems. GvDB, pp 39–43Google Scholar
  13. 13.
    Dorok S (2017) Efficient storage and analysis of genome data in relational database systems. PhD thesis. School of Computer ScienceGoogle Scholar
  14. 14.
    Dorok S, Breß S, Saake G (2014) Toward efficient variant calling inside main-memory database systems. BIOKDD-DEXA, pp 41–45Google Scholar
  15. 15.
    Dorok S, Breß S, Teubner J et al (2017) Efficient storage and analysis of genome data in databases. BTW, pp 423–442Google Scholar
  16. 16.
    Eltabakh MY, Ouzzani M, Aref WG (2007) bdbms - A database management system for biological data. CIDR, pp 196–206Google Scholar
  17. 17.
    Fähnrich C, Schapranow M, Plattner H (2015) Facing the genome data deluge: efficiently identifying genetic variants with in-memory database technology. SAC, pp 18–25Google Scholar
  18. 18.
    Hsi-Yang MF, Leinonen R, Cochrane G, Birney E (2011) Efficient storage of high throughput DNA sequencing data using reference-based compression. Genome Res 21:734–740CrossRefGoogle Scholar
  19. 19.
    Kuenne C, Grosse I, Matthies I et al (2007) Using data warehouse technology in crop plant bioinformatics. J Integr Bioinform 4(1). doi: 10.2390/biecoll-jib-2007-88 Google Scholar
  20. 20.
    Lee TJ, Pouliot Y, Wagner V et al (2006) BioWarehouse: a bioinformatics database warehouse toolkit. BMC Bioinformatics 7(1):170CrossRefGoogle Scholar
  21. 21.
    Li H, Homer N (2010) A survey of sequence alignment algorithms for next-generation sequencing. Brief Bioinformatics 11(5):473–483CrossRefGoogle Scholar
  22. 22.
    Li H, Handsaker B, Wysoker A et al (2009) The sequence alignment/map format and samtools. Bioinformatics 25(16):2078–2079CrossRefGoogle Scholar
  23. 23.
    Liu L, Li Y, Li S et al (2012) Comparison of next-generation sequencing systems. J Biomed Biotechnol 2012:1–11Google Scholar
  24. 24.
    Mardis ER (2010) The $1,000 genome, the $100,000 analysis? Genome Med 2(11):1–3CrossRefGoogle Scholar
  25. 25.
    Mavaddat N, Peock S, Frost D et al (2013) Cancer risks for BRCA1 and BRCA2 mutation carriers: results from prospective analysis of EMBRACE. J Natl Cancer Inst 105(11):812–822. doi: 10.1093/jnci/djt095 CrossRefGoogle Scholar
  26. 26.
    Nielsen R, Paul JS, Albrechtsen A, Song YS (2011) Genotype and SNP calling from next-generation sequencing data. Nat Rev Genet 12(6):443–451CrossRefGoogle Scholar
  27. 27.
    Quail M, Smith M, Coupland P et al (2012) A tale of three next generation sequencing platforms: comparison of Ion Torrent, Pacific Biosciences and Illumina MiSeq sequencers. BMC Genomics 13(1):341CrossRefGoogle Scholar
  28. 28.
    Röhm U, Blakeley JA (2009) Data management for high-throughput genomics. CIDR.Google Scholar
  29. 29.
    SAM/BAM Format Specification Working Group (2015) Sequence alignment/map format specification. Google Scholar
  30. 30.
    Sandve GK, Nekrutenko A, Taylor J, Hovig E (2013) Ten simple rules for reproducible computational research. PLoS Comput Biol 9(10). doi: 10.1371/journal.pcbi.1003285 Google Scholar
  31. 31.
    Shah SP, Huang Y, Xu T et al (2005) Atlas – a data warehouse for integrative bioinformatics. BMC Bioinformatics 6:34CrossRefGoogle Scholar
  32. 32.
    Stein LD, Thierry-Mieg J (1999) AceDB: A genome database management system. Comput Sci Eng 1(3):44–52CrossRefGoogle Scholar
  33. 33.
    The 1000 Genomes Project Consortium (2015) A global reference for human genetic variation. Nature 526(7571):68–74CrossRefGoogle Scholar
  34. 34.
    Töpel T, Kormeier B, Klassen A, Hofestädt R (2008) BioDWH: a data warehouse kit for life science data integration. J Integr Bioinform 5(2). doi: 10.2390/biecoll-jib-2008-93 Google Scholar
  35. 35.
    Wandelt S, Starlinger J, Bux M, Leser U (2013) RCSI: scalable similarity search in thousand(s) of genomes. Proc VLDB Endow 6(13):1534–1545CrossRefGoogle Scholar
  36. 36.
    Wu K, Otoo E, Shoshani A (2006) Optimizing bitmap indices with efficient compression. ACM Trans Database Syst 31(1):1–38CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2017

Authors and Affiliations

  • Sebastian Dorok
    • 1
    Email author
  • Sebastian Breß
    • 2
    • 3
  • Jens Teubner
    • 4
  • Horstfried Läpple
    • 5
  • Gunter Saake
    • 1
  • Volker Markl
    • 2
    • 3
  1. 1.University MagdeburgMagdeburgGermany
  2. 2.DFKI GmbHBerlinGermany
  3. 3.TU BerlinBerlinGermany
  4. 4.TU DortmundDortmundGermany
  5. 5.Bayer HealthCare AGBerlinGermany

Personalised recommendations