, Volume 15, Issue 3, pp 185–191 | Cite as

Genome sequence analysis with MonetDB

A case study on Ebola virus diversity
  • Robin CijvatEmail author
  • Stefan Manegold
  • Martin Kersten
  • Gunnar W. Klau
  • Alexander Schönhuth
  • Tobias Marschall
  • Ying Zhang


Next-generation sequencing (NGS) technology has led the life sciences into the big data era. Today, sequencing genomes takes little time and cost, but yields terabytes of data to be stored and analyzed. Biologists are often exposed to excessively time consuming and error-prone data management and analysis hurdles. In this paper, we propose a database management system (DBMS) based approach to accelerate and substantially simplify genome sequence analysis. We have extended MonetDB, an open-source column-based DBMS, with a BAM module, which enables easy, flexible, and rapid management and analysis of sequence alignment data stored as Sequence Alignment/Map (SAM/BAM) files. We describe the main features of MonetDB/BAM using a case study on Ebola virus



Reference String Genome Data Analysis Alignment Pair Hadoop System Modern Life Science 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Beerenwinkel N et al (2012) Challenges and opportunities in estimating viral genetic diversity from next-generation sequencing data. Front Microbiol 3:329Google Scholar
  2. 2.
    Cijvat R (2014) Bridging the gap between big genome data analysis and database management systems. Master’s thesis, CWI and Utrecht UniversityGoogle Scholar
  3. 3.
    Dean J, Ghemawat S (2004) MapReduce: simplified data processing on large clusters. Paper presented at the 6th Symposium on Operating System Design and Implementation, San Francisco, December 2004Google Scholar
  4. 4.
    Dorok S et al (2014) Toward Efficient Variant Calling Inside Main-Memory Database Systems. BIOKDD-DEXA Workshops, pp. 41–45Google Scholar
  5. 5.
    Gire SK et al (2014) Genomic surveillance elucidates Ebola virus origin and transmission during the 2014 outbreak. Science 345(6202):1369–1372Google Scholar
  6. 6.
    Kargin Y, Kersten ML, Manegold S, Pirk H (2015) The DBMS—your big data sommelier. Proceedings of IEEE International Conference on Data Engineering 2015 (ICDE 31)Google Scholar
  7. 7.
    Li H, Durbin R (2009) Fast and accurate short read alignment with burrows-wheeler transform. Bioinformatics 25:1754–1760Google Scholar
  8. 8.
    Li H et al (2009) The Sequence Alignment/{M}ap format and SAMtools. Bioinformatics 25:2078–2079Google Scholar
  9. 9.
    Manegold S et al (2009) Database architecture evolution: mammals flourished long before dinosaurs became extinct. PVLDB 2(2):1648–1653Google Scholar
  10. 10.
    Pavlo A et al (2009) A Comparison of Approaches to Large-Scale Data Analysis. SIGMOD, pp. 165–178Google Scholar
  11. 11.
    Quinlan A Hall I (2010) BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26(6):841–842Google Scholar
  12. 12.
    Röhm U, Blakeley JA (2009) Data management for high-throughput genomics. CIDR, pp. 97–111Google Scholar
  13. 13.
    Schapranow MP, Plattner H (2013) HIG - An in-memory database platform enabling real-time analyses of genome data. BigData, pp. 691–696Google Scholar
  14. 14.
    Schatz MC, Langmead B (2013) The DNA data deluge. IEEE Spectrum 50(7):28–33Google Scholar
  15. 15.
    Toepfer A et al (2014) Viral quasispecies assembly via maximal clique enumeration. PLoS Comput Biol 10(3):e1003515Google Scholar
  16. 16.
    Volchkov VE et al (1999) Characterization of the L gene and 5` trailer region of Ebola virus. J Gen Virol 80(Pt2):355–362Google Scholar
  17. 17.
    Wolstencroft K et al (2013) The Taverna workflow suite: designing and executing workflows of Web Services on the desktop, web or in the cloud. Nucleic Acids Res 41(Web Server issue):W557–W561Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2015

Authors and Affiliations

  • Robin Cijvat
    • 1
    Email author
  • Stefan Manegold
    • 2
  • Martin Kersten
    • 1
    • 2
  • Gunnar W. Klau
    • 2
  • Alexander Schönhuth
    • 2
  • Tobias Marschall
    • 3
  • Ying Zhang
    • 1
    • 2
  1. 1.MonetDB SolutionsAmsterdamThe Netherlands
  2. 2.Centrum Wiskunde & InformaticaAmsterdamThe Netherlands
  3. 3.Saarland University & Max Planck Institute for InformaticsSaarbrückenGermany

Personalised recommendations