Scheduling and Execution of Genome Data Processing Pipelines

Bock, Cornelius

doi:10.1007/978-3-319-03035-7_3

Scheduling and Execution of Genome Data Processing Pipelines

Cornelius Bock⁴

Chapter
First Online: 19 November 2013

2536 Accesses

Part of the book series: In-Memory Data Management Research ((IMDM))

Abstract

Genome data processing pipelines extract a list of mutated genes out of raw genome data, which is an essential information required by physicians and researchers from a patient’s genome. Improving performance, stability and flexibility of pipeline execution environments is therefore a matter of importance. Today, analysing a sequenced human genome takes up to several weeks. In scientific environments, parallelization using computer clusters and in-memory technology can accelerate execution to several hours, but flexibility and scheduling need to be improved, too. In this work, I propose a way to execute diverse genome data processing pipelines on a cluster ofworker machines coordinated by a single scheduler using an in-memory database. I will show how a scheduling algorithm, which estimates processing time by analyzing execution logs, improves the systems throughput. I will also discuss how the database can be used as communication medium, log, decision instance and statistics service and how the system can benefit from its power.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Hardcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Ambler SW (2005) The Elements of UML 2.0 Style. Cambridge University Press
Google Scholar
Ansorge WJ (2009) Next-generation DNA Sequencing Techniques. New Biotechnology 25(4):195–203
Article PubMed CAS Google Scholar
Bray T et al. (1997) Extensible Markup Language (XML). World Wide Web Journal 2(4):27–66
Google Scholar
Cock PJ et al. (2010) The Sanger FASTQ File Format for Sequences with Quality Scores, and the Solexa/Illumina FASTQ Variants. Nucleic Acids Research 38(6):1767–1771
Article PubMed CAS Google Scholar
Crockford D (2006) RFC4627: The application/json Media Type for JavaScript Object Notation (JSON). http://www.ietf.org/rfc/rfc4627.txt. Accessed Sep 23, 2013
Ehses E et al. (2005) Betriebssysteme - Ein Lehrbuch mit Übungen zur Systemprogrammierung in UNIX/Linux. Pearson Studium
Google Scholar
Fowler M (2004) Inversion of Control Containers and the Dependency Injection Pattern. http://www.martinfowler.com/articles/injection.html. Accessed Sep 23, 2013
Gamma E et al. (2001) Design Patterns: Abstraction and Reuse of Objectoriented Design. Springer
Google Scholar
Li Ff, Yu Xz, Wu G (2009) Design and Implementation of High Availability Distributed System based on Multi-level Heartbeat Protocol. In: Proceedings of the International Conference on Control, Automation and Systems Engineering, IEEE, pp 83–87
Google Scholar
Li H, Ruan J, Durbin R (2008) Mapping Short DNA Sequencing Reads and Calling Variants using Mapping Quality Scores. Genome Research 18(11):1851–1858
Article PubMed CAS Google Scholar
Plattner H (2013) A Course in In-Memory Data Management: The Inner Mechanics of In-Memory Databases. Springer
Google Scholar
Python Software Foundation (2013) Using Lists as Queues. http://docs.python.org/2/tutorial/datastructures.html. Accessed Sep 23, 2013
Sugimori Y et al. (1977) Toyota Production System and Kanban System Materialization of Just-in-time and Respect-for-human System. The International Journal of Production Research 15(6):553–564
Article Google Scholar
Tanenbaum AS (2009) Modern Operating Systems, 3rd edn. Pearson Prentice Hall
Google Scholar
Waldspurger CA, Weihl WE (1994) Lottery scheduling: Flexible proportional-share resource management. In: Proceedings of the Operating Systems Design and Implementation, USENIX Association
Google Scholar

Download references

Author information

Authors and Affiliations

Potsdam, Germany
Cornelius Bock

Authors

Cornelius Bock
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Cornelius Bock .

Editor information

Editors and Affiliations

Enterprise Platform and Integration Concepts, Hasso-Plattner-Institute, Potsdam, Germany
Hasso Plattner
Enterprise Platform and Integration Concepts Chair, Hasso Plattner Institute, Potsdam, Germany
Matthieu-P. Schapranow

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Bock, C. (2014). Scheduling and Execution of Genome Data Processing Pipelines. In: Plattner, H., Schapranow, MP. (eds) High-Performance In-Memory Genome Data Analysis. In-Memory Data Management Research. Springer, Cham. https://doi.org/10.1007/978-3-319-03035-7_3

Download citation

DOI: https://doi.org/10.1007/978-3-319-03035-7_3
Published: 19 November 2013
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-03034-0
Online ISBN: 978-3-319-03035-7
eBook Packages: Biomedical and Life SciencesBiomedical and Life Sciences (R0)

Publish with us

Policies and ethics