Skip to main content

Scheduling and Execution of Genome Data Processing Pipelines

  • Chapter
  • First Online:
  • 2536 Accesses

Part of the book series: In-Memory Data Management Research ((IMDM))

Abstract

Genome data processing pipelines extract a list of mutated genes out of raw genome data, which is an essential information required by physicians and researchers from a patient’s genome. Improving performance, stability and flexibility of pipeline execution environments is therefore a matter of importance. Today, analysing a sequenced human genome takes up to several weeks. In scientific environments, parallelization using computer clusters and in-memory technology can accelerate execution to several hours, but flexibility and scheduling need to be improved, too. In this work, I propose a way to execute diverse genome data processing pipelines on a cluster ofworker machines coordinated by a single scheduler using an in-memory database. I will show how a scheduling algorithm, which estimates processing time by analyzing execution logs, improves the systems throughput. I will also discuss how the database can be used as communication medium, log, decision instance and statistics service and how the system can benefit from its power.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   129.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD   169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Ambler SW (2005) The Elements of UML 2.0 Style. Cambridge University Press

    Google Scholar 

  2. Ansorge WJ (2009) Next-generation DNA Sequencing Techniques. New Biotechnology 25(4):195–203

    Article  PubMed  CAS  Google Scholar 

  3. Bray T et al. (1997) Extensible Markup Language (XML). World Wide Web Journal 2(4):27–66

    Google Scholar 

  4. Cock PJ et al. (2010) The Sanger FASTQ File Format for Sequences with Quality Scores, and the Solexa/Illumina FASTQ Variants. Nucleic Acids Research 38(6):1767–1771

    Article  PubMed  CAS  Google Scholar 

  5. Crockford D (2006) RFC4627: The application/json Media Type for JavaScript Object Notation (JSON). http://www.ietf.org/rfc/rfc4627.txt. Accessed Sep 23, 2013

  6. Ehses E et al. (2005) Betriebssysteme - Ein Lehrbuch mit Übungen zur Systemprogrammierung in UNIX/Linux. Pearson Studium

    Google Scholar 

  7. Fowler M (2004) Inversion of Control Containers and the Dependency Injection Pattern. http://www.martinfowler.com/articles/injection.html. Accessed Sep 23, 2013

  8. Gamma E et al. (2001) Design Patterns: Abstraction and Reuse of Objectoriented Design. Springer

    Google Scholar 

  9. Li Ff, Yu Xz, Wu G (2009) Design and Implementation of High Availability Distributed System based on Multi-level Heartbeat Protocol. In: Proceedings of the International Conference on Control, Automation and Systems Engineering, IEEE, pp 83–87

    Google Scholar 

  10. Li H, Ruan J, Durbin R (2008) Mapping Short DNA Sequencing Reads and Calling Variants using Mapping Quality Scores. Genome Research 18(11):1851–1858

    Article  PubMed  CAS  Google Scholar 

  11. Plattner H (2013) A Course in In-Memory Data Management: The Inner Mechanics of In-Memory Databases. Springer

    Google Scholar 

  12. Python Software Foundation (2013) Using Lists as Queues. http://docs.python.org/2/tutorial/datastructures.html. Accessed Sep 23, 2013

  13. Sugimori Y et al. (1977) Toyota Production System and Kanban System Materialization of Just-in-time and Respect-for-human System. The International Journal of Production Research 15(6):553–564

    Article  Google Scholar 

  14. Tanenbaum AS (2009) Modern Operating Systems, 3rd edn. Pearson Prentice Hall

    Google Scholar 

  15. Waldspurger CA, Weihl WE (1994) Lottery scheduling: Flexible proportional-share resource management. In: Proceedings of the Operating Systems Design and Implementation, USENIX Association

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Cornelius Bock .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this chapter

Cite this chapter

Bock, C. (2014). Scheduling and Execution of Genome Data Processing Pipelines. In: Plattner, H., Schapranow, MP. (eds) High-Performance In-Memory Genome Data Analysis. In-Memory Data Management Research. Springer, Cham. https://doi.org/10.1007/978-3-319-03035-7_3

Download citation

Publish with us

Policies and ethics