Abstract
Next-generation sequencing (NGS) technology has led the life sciences into the big data era. Today, sequencing genomes takes little time and cost, but yields terabytes of data to be stored and analyzed. Biologists are often exposed to excessively time consuming and error-prone data management and analysis hurdles. In this paper, we propose a database management system (DBMS) based approach to accelerate and substantially simplify genome sequence analysis. We have extended MonetDB, an open-source column-based DBMS, with a BAM module, which enables easy, flexible, and rapid management and analysis of sequence alignment data stored as Sequence Alignment/Map (SAM/BAM) files. We describe the main features of MonetDB/BAM using a case study on Ebola virus
genomes.
Similar content being viewed by others
Notes
For this use case, we do not benefit from the read oriented storage that MonetDB/BAM uses. However, [2] shows many use cases for which it does.
Literatur
Beerenwinkel N et al (2012) Challenges and opportunities in estimating viral genetic diversity from next-generation sequencing data. Front Microbiol 3:329
Cijvat R (2014) Bridging the gap between big genome data analysis and database management systems. Master’s thesis, CWI and Utrecht University
Dean J, Ghemawat S (2004) MapReduce: simplified data processing on large clusters. Paper presented at the 6th Symposium on Operating System Design and Implementation, San Francisco, December 2004
Dorok S et al (2014) Toward Efficient Variant Calling Inside Main-Memory Database Systems. BIOKDD-DEXA Workshops, pp. 41–45
Gire SK et al (2014) Genomic surveillance elucidates Ebola virus origin and transmission during the 2014 outbreak. Science 345(6202):1369–1372
Kargin Y, Kersten ML, Manegold S, Pirk H (2015) The DBMS—your big data sommelier. Proceedings of IEEE International Conference on Data Engineering 2015 (ICDE 31)
Li H, Durbin R (2009) Fast and accurate short read alignment with burrows-wheeler transform. Bioinformatics 25:1754–1760
Li H et al (2009) The Sequence Alignment/{M}ap format and SAMtools. Bioinformatics 25:2078–2079
Manegold S et al (2009) Database architecture evolution: mammals flourished long before dinosaurs became extinct. PVLDB 2(2):1648–1653
Pavlo A et al (2009) A Comparison of Approaches to Large-Scale Data Analysis. SIGMOD, pp. 165–178
Quinlan A Hall I (2010) BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26(6):841–842
Röhm U, Blakeley JA (2009) Data management for high-throughput genomics. CIDR, pp. 97–111
Schapranow MP, Plattner H (2013) HIG - An in-memory database platform enabling real-time analyses of genome data. BigData, pp. 691–696
Schatz MC, Langmead B (2013) The DNA data deluge. IEEE Spectrum 50(7):28–33
Toepfer A et al (2014) Viral quasispecies assembly via maximal clique enumeration. PLoS Comput Biol 10(3):e1003515
Volchkov VE et al (1999) Characterization of the L gene and 5` trailer region of Ebola virus. J Gen Virol 80(Pt2):355–362
Wolstencroft K et al (2013) The Taverna workflow suite: designing and executing workflows of Web Services on the desktop, web or in the cloud. Nucleic Acids Res 41(Web Server issue):W557–W561
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Cijvat, R., Manegold, S., Kersten, M. et al. Genome sequence analysis with MonetDB. Datenbank Spektrum 15, 185–191 (2015). https://doi.org/10.1007/s13222-015-0198-x
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13222-015-0198-x