Skip to main content
Log in

Genome sequence analysis with MonetDB

A case study on Ebola virus diversity

  • SCHWERPUNKTBEITRAG
  • Published:
Datenbank-Spektrum Aims and scope Submit manuscript

Abstract

Next-generation sequencing (NGS) technology has led the life sciences into the big data era. Today, sequencing genomes takes little time and cost, but yields terabytes of data to be stored and analyzed. Biologists are often exposed to excessively time consuming and error-prone data management and analysis hurdles. In this paper, we propose a database management system (DBMS) based approach to accelerate and substantially simplify genome sequence analysis. We have extended MonetDB, an open-source column-based DBMS, with a BAM module, which enables easy, flexible, and rapid management and analysis of sequence alignment data stored as Sequence Alignment/Map (SAM/BAM) files. We describe the main features of MonetDB/BAM using a case study on Ebola virus

genomes.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Abb. 1
Abb. 2
Abb. 3
Abb. 4

Similar content being viewed by others

Notes

  1. https://www.monetdb.org/

  2. https://www.monetdb.org/bam/

  3. For this use case, we do not benefit from the read oriented storage that MonetDB/BAM uses. However, [2] shows many use cases for which it does.

Literatur

  1. Beerenwinkel N et al (2012) Challenges and opportunities in estimating viral genetic diversity from next-generation sequencing data. Front Microbiol 3:329

  2. Cijvat R (2014) Bridging the gap between big genome data analysis and database management systems. Master’s thesis, CWI and Utrecht University

  3. Dean J, Ghemawat S (2004) MapReduce: simplified data processing on large clusters. Paper presented at the 6th Symposium on Operating System Design and Implementation, San Francisco, December 2004

  4. Dorok S et al (2014) Toward Efficient Variant Calling Inside Main-Memory Database Systems. BIOKDD-DEXA Workshops, pp. 41–45

  5. Gire SK et al (2014) Genomic surveillance elucidates Ebola virus origin and transmission during the 2014 outbreak. Science 345(6202):1369–1372

  6. Kargin Y, Kersten ML, Manegold S, Pirk H (2015) The DBMS—your big data sommelier. Proceedings of IEEE International Conference on Data Engineering 2015 (ICDE 31)

  7. Li H, Durbin R (2009) Fast and accurate short read alignment with burrows-wheeler transform. Bioinformatics 25:1754–1760

  8. Li H et al (2009) The Sequence Alignment/{M}ap format and SAMtools. Bioinformatics 25:2078–2079

  9. Manegold S et al (2009) Database architecture evolution: mammals flourished long before dinosaurs became extinct. PVLDB 2(2):1648–1653

  10. Pavlo A et al (2009) A Comparison of Approaches to Large-Scale Data Analysis. SIGMOD, pp. 165–178

  11. Quinlan A Hall I (2010) BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26(6):841–842

  12. Röhm U, Blakeley JA (2009) Data management for high-throughput genomics. CIDR, pp. 97–111

  13. Schapranow MP, Plattner H (2013) HIG - An in-memory database platform enabling real-time analyses of genome data. BigData, pp. 691–696

  14. Schatz MC, Langmead B (2013) The DNA data deluge. IEEE Spectrum 50(7):28–33

  15. Toepfer A et al (2014) Viral quasispecies assembly via maximal clique enumeration. PLoS Comput Biol 10(3):e1003515

  16. Volchkov VE et al (1999) Characterization of the L gene and 5` trailer region of Ebola virus. J Gen Virol 80(Pt2):355–362

  17. Wolstencroft K et al (2013) The Taverna workflow suite: designing and executing workflows of Web Services on the desktop, web or in the cloud. Nucleic Acids Res 41(Web Server issue):W557–W561

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Robin Cijvat.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Cijvat, R., Manegold, S., Kersten, M. et al. Genome sequence analysis with MonetDB. Datenbank Spektrum 15, 185–191 (2015). https://doi.org/10.1007/s13222-015-0198-x

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13222-015-0198-x

Keywords

Navigation