Efficiently Storing and Analyzing Genome Data in Database Systems

Dorok, Sebastian; Breß, Sebastian; Teubner, Jens; Läpple, Horstfried; Saake, Gunter; Markl, Volker

doi:10.1007/s13222-017-0254-9

Efficiently Storing and Analyzing Genome Data in Database Systems

Fachbeitrag
Published: 16 June 2017

Volume 17, pages 139–154, (2017)
Cite this article

Datenbank-Spektrum Aims and scope Submit manuscript

Sebastian Dorok¹,
Sebastian Breß^2,3,
Jens Teubner⁴,
Horstfried Läpple⁵,
Gunter Saake¹ &
…
Volker Markl^2,3

338 Accesses
4 Citations
Explore all metrics

Abstract

Genome-analysis enables researchers to detect mutations within genomes and deduce their consequences. Researchers need reliable analysis platforms to ensure reproducible and comprehensive analysis results. Database systems provide vital support to implement the required sustainable procedures. Nevertheless, they are not used throughout the complete genome-analysis process, because (1) database systems suffer from high storage overhead for genome data and (2) they introduce overhead during domain-specific analysis. To overcome these limitations, we integrate genome-specific compression into database systems using a specialized database schema. Thus, we can reduce the storage consumption of a database approach by up to 35%. Moreover, we exploit genome-data characteristics during query processing allowing us to analyze real-world data sets up to five times faster than specialized analysis tools and eight times faster than a straightforward database approach.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Trends and Advancements in Genome Data Compression and Processing Algorithms

Towards a Model-Driven Approach for Big Data Analytics in the Genomics Field

Update on Genomic Databases and Resources at the National Center for Biotechnology Information

Notes

For simplicity, we only consider mismatching bases and omit inserted or deleted bases.
Using the base-centric database schema, we already apply CIGAR operations to the base values of reads.
We have to subtract a possible offset if the index of interest is encoded within the fill word.
data is available at ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/data/HG00096/

References

Abadi D, Madden S, Ferreira M (2006) Integrating compression and execution in column-oriented database systems. SIGMOD, pp 671–682
Google Scholar
Abadi D, Madden S, Hachem N (2008) Column-stores vs. row-stores: How different are they really? SIGMOD, pp 967–980
Google Scholar
Bhagwat D, Chiticariu L, Tan W-C, Vijayvargiya G (2004) An annotation management system for relational databases. VLDB, pp 900–911
Google Scholar
Bloniarz A, Talwalkar A, Terhorst J et al (2014) Changepoint analysis for efficient variant calling. RECOMB, pp 20–34
Google Scholar
Breß S (2014) The design and implementation of cogaDB: a column-oriented GPU-accelerated DBMS. Datenbank Spektr 14(3):199–209
Article Google Scholar
Breß S, Funke H, Teubner J (2016) Robust query processing in co-processor-accelerated databases. SIGMOD, pp 1891–1906
Google Scholar
Bromberg Y (2013) Building a genome analysis pipeline to predict disease risk and prevent disease. J Mol Biol 425(21):3993–4005
Article Google Scholar
Ceri S, Kaitoua A, Masseroli M, Pinoli P, Venco F (2016) Data management for next generation genomic computing. EDBT, pp 485–490
Google Scholar
Cijvat R, Manegold S, Kersten M et al (2015) Genome sequence analysis with MonetDB. Datenbank Spektrum 15(3):185–191
Article Google Scholar
Working Group (2015) CRAM Format Specification. https://samtools.github.io/hts-specs/CRAMv3.pdf
Google Scholar
DePristo M, Banks E, Poplin R et al (2011) A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet 43(5):491–498
Article Google Scholar
Dorok S (2016) Memory efficient processing of DNA sequences in relational main-memory database systems. GvDB, pp 39–43
Google Scholar
Dorok S (2017) Efficient storage and analysis of genome data in relational database systems. PhD thesis. School of Computer Science
Dorok S, Breß S, Saake G (2014) Toward efficient variant calling inside main-memory database systems. BIOKDD-DEXA, pp 41–45
Google Scholar
Dorok S, Breß S, Teubner J et al (2017) Efficient storage and analysis of genome data in databases. BTW, pp 423–442
Google Scholar
Eltabakh MY, Ouzzani M, Aref WG (2007) bdbms - A database management system for biological data. CIDR, pp 196–206
Google Scholar
Fähnrich C, Schapranow M, Plattner H (2015) Facing the genome data deluge: efficiently identifying genetic variants with in-memory database technology. SAC, pp 18–25
Google Scholar
Hsi-Yang MF, Leinonen R, Cochrane G, Birney E (2011) Efficient storage of high throughput DNA sequencing data using reference-based compression. Genome Res 21:734–740
Article Google Scholar
Kuenne C, Grosse I, Matthies I et al (2007) Using data warehouse technology in crop plant bioinformatics. J Integr Bioinform 4(1). doi:10.2390/biecoll-jib-2007-88
Google Scholar
Lee TJ, Pouliot Y, Wagner V et al (2006) BioWarehouse: a bioinformatics database warehouse toolkit. BMC Bioinformatics 7(1):170
Article Google Scholar
Li H, Homer N (2010) A survey of sequence alignment algorithms for next-generation sequencing. Brief Bioinformatics 11(5):473–483
Article Google Scholar
Li H, Handsaker B, Wysoker A et al (2009) The sequence alignment/map format and samtools. Bioinformatics 25(16):2078–2079
Article Google Scholar
Liu L, Li Y, Li S et al (2012) Comparison of next-generation sequencing systems. J Biomed Biotechnol 2012:1–11
Google Scholar
Mardis ER (2010) The $1,000 genome, the $100,000 analysis? Genome Med 2(11):1–3
Article Google Scholar
Mavaddat N, Peock S, Frost D et al (2013) Cancer risks for BRCA1 and BRCA2 mutation carriers: results from prospective analysis of EMBRACE. J Natl Cancer Inst 105(11):812–822. doi:10.1093/jnci/djt095
Article Google Scholar
Nielsen R, Paul JS, Albrechtsen A, Song YS (2011) Genotype and SNP calling from next-generation sequencing data. Nat Rev Genet 12(6):443–451
Article Google Scholar
Quail M, Smith M, Coupland P et al (2012) A tale of three next generation sequencing platforms: comparison of Ion Torrent, Pacific Biosciences and Illumina MiSeq sequencers. BMC Genomics 13(1):341
Article Google Scholar
Röhm U, Blakeley JA (2009) Data management for high-throughput genomics. CIDR.
Google Scholar
SAM/BAM Format Specification Working Group (2015) Sequence alignment/map format specification. https://samtools.github.io/hts-specs/SAMv1.pdf
Google Scholar
Sandve GK, Nekrutenko A, Taylor J, Hovig E (2013) Ten simple rules for reproducible computational research. PLoS Comput Biol 9(10). doi:10.1371/journal.pcbi.1003285
Google Scholar
Shah SP, Huang Y, Xu T et al (2005) Atlas – a data warehouse for integrative bioinformatics. BMC Bioinformatics 6:34
Article Google Scholar
Stein LD, Thierry-Mieg J (1999) AceDB: A genome database management system. Comput Sci Eng 1(3):44–52
Article Google Scholar
The 1000 Genomes Project Consortium (2015) A global reference for human genetic variation. Nature 526(7571):68–74
Article Google Scholar
Töpel T, Kormeier B, Klassen A, Hofestädt R (2008) BioDWH: a data warehouse kit for life science data integration. J Integr Bioinform 5(2). doi:10.2390/biecoll-jib-2008-93
Google Scholar
Wandelt S, Starlinger J, Bux M, Leser U (2013) RCSI: scalable similarity search in thousand(s) of genomes. Proc VLDB Endow 6(13):1534–1545
Article Google Scholar
Wu K, Otoo E, Shoshani A (2006) Optimizing bitmap indices with efficient compression. ACM Trans Database Syst 31(1):1–38
Article Google Scholar

Download references

Acknowledgements

The work has received funding from the German Research Foundation (DFG), Collaborative Research Center SFB 876, project C5, from the European Union’s Horizon2020 Research & Innovation Program under grant agreement 671500 (project SAGE), and by the German Ministry for Education and Research as Berlin Big Data Center BBDC (01IS14013A).

Author information

Authors and Affiliations

University Magdeburg, Magdeburg, Germany
Sebastian Dorok & Gunter Saake
DFKI GmbH, Berlin, Germany
Sebastian Breß & Volker Markl
TU Berlin, Berlin, Germany
Sebastian Breß & Volker Markl
TU Dortmund, Dortmund, Germany
Jens Teubner
Bayer HealthCare AG, Berlin, Germany
Horstfried Läpple

Authors

Sebastian Dorok
View author publications
You can also search for this author in PubMed Google Scholar
Sebastian Breß
View author publications
You can also search for this author in PubMed Google Scholar
Jens Teubner
View author publications
You can also search for this author in PubMed Google Scholar
Horstfried Läpple
View author publications
You can also search for this author in PubMed Google Scholar
Gunter Saake
View author publications
You can also search for this author in PubMed Google Scholar
Volker Markl
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sebastian Dorok.

Additional information

This is an extended version of our earlier work [15].

Work by S. Dorok was done in part when employed at Bayer Business Services GmbH and Bayer Pharma AG.

Work by S. Breß was done in part when employed at TU Dortmund.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Dorok, S., Breß, S., Teubner, J. et al. Efficiently Storing and Analyzing Genome Data in Database Systems. Datenbank Spektrum 17, 139–154 (2017). https://doi.org/10.1007/s13222-017-0254-9

Download citation

Received: 19 April 2017
Accepted: 22 May 2017
Published: 16 June 2017
Issue Date: July 2017
DOI: https://doi.org/10.1007/s13222-017-0254-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Efficiently Storing and Analyzing Genome Data in Database Systems

Abstract

Access this article

Similar content being viewed by others

Trends and Advancements in Genome Data Compression and Processing Algorithms

Towards a Model-Driven Approach for Big Data Analytics in the Genomics Field

Update on Genomic Databases and Resources at the National Center for Biotechnology Information

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Efficiently Storing and Analyzing Genome Data in Database Systems

Abstract

Access this article

Similar content being viewed by others

Trends and Advancements in Genome Data Compression and Processing Algorithms

Towards a Model-Driven Approach for Big Data Analytics in the Genomics Field

Update on Genomic Databases and Resources at the National Center for Biotechnology Information

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation