Abstract
Genome data processing pipelines extract a list of mutated genes out of raw genome data, which is an essential information required by physicians and researchers from a patient’s genome. Improving performance, stability and flexibility of pipeline execution environments is therefore a matter of importance. Today, analysing a sequenced human genome takes up to several weeks. In scientific environments, parallelization using computer clusters and in-memory technology can accelerate execution to several hours, but flexibility and scheduling need to be improved, too. In this work, I propose a way to execute diverse genome data processing pipelines on a cluster ofworker machines coordinated by a single scheduler using an in-memory database. I will show how a scheduling algorithm, which estimates processing time by analyzing execution logs, improves the systems throughput. I will also discuss how the database can be used as communication medium, log, decision instance and statistics service and how the system can benefit from its power.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Ambler SW (2005) The Elements of UML 2.0 Style. Cambridge University Press
Ansorge WJ (2009) Next-generation DNA Sequencing Techniques. New Biotechnology 25(4):195–203
Bray T et al. (1997) Extensible Markup Language (XML). World Wide Web Journal 2(4):27–66
Cock PJ et al. (2010) The Sanger FASTQ File Format for Sequences with Quality Scores, and the Solexa/Illumina FASTQ Variants. Nucleic Acids Research 38(6):1767–1771
Crockford D (2006) RFC4627: The application/json Media Type for JavaScript Object Notation (JSON). http://www.ietf.org/rfc/rfc4627.txt. Accessed Sep 23, 2013
Ehses E et al. (2005) Betriebssysteme - Ein Lehrbuch mit Übungen zur Systemprogrammierung in UNIX/Linux. Pearson Studium
Fowler M (2004) Inversion of Control Containers and the Dependency Injection Pattern. http://www.martinfowler.com/articles/injection.html. Accessed Sep 23, 2013
Gamma E et al. (2001) Design Patterns: Abstraction and Reuse of Objectoriented Design. Springer
Li Ff, Yu Xz, Wu G (2009) Design and Implementation of High Availability Distributed System based on Multi-level Heartbeat Protocol. In: Proceedings of the International Conference on Control, Automation and Systems Engineering, IEEE, pp 83–87
Li H, Ruan J, Durbin R (2008) Mapping Short DNA Sequencing Reads and Calling Variants using Mapping Quality Scores. Genome Research 18(11):1851–1858
Plattner H (2013) A Course in In-Memory Data Management: The Inner Mechanics of In-Memory Databases. Springer
Python Software Foundation (2013) Using Lists as Queues. http://docs.python.org/2/tutorial/datastructures.html. Accessed Sep 23, 2013
Sugimori Y et al. (1977) Toyota Production System and Kanban System Materialization of Just-in-time and Respect-for-human System. The International Journal of Production Research 15(6):553–564
Tanenbaum AS (2009) Modern Operating Systems, 3rd edn. Pearson Prentice Hall
Waldspurger CA, Weihl WE (1994) Lottery scheduling: Flexible proportional-share resource management. In: Proceedings of the Operating Systems Design and Implementation, USENIX Association
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this chapter
Cite this chapter
Bock, C. (2014). Scheduling and Execution of Genome Data Processing Pipelines. In: Plattner, H., Schapranow, MP. (eds) High-Performance In-Memory Genome Data Analysis. In-Memory Data Management Research. Springer, Cham. https://doi.org/10.1007/978-3-319-03035-7_3
Download citation
DOI: https://doi.org/10.1007/978-3-319-03035-7_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-03034-0
Online ISBN: 978-3-319-03035-7
eBook Packages: Biomedical and Life SciencesBiomedical and Life Sciences (R0)