Advertisement

Performance Analysis of a Parallel, Multi-node Pipeline for DNA Sequencing

  • Dries Decap
  • Joke Reumers
  • Charlotte Herzeel
  • Pascal Costanza
  • Jan Fostier
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9574)

Abstract

Post-sequencing DNA analysis typically consists of read mapping followed by variant calling and is very time-consuming, even on a multi-core machine. Recently, we proposed Halvade, a parallel, multi-node implementation of a DNA sequencing pipeline according to the GATK Best Practices recommendations. The MapReduce programming model is used to distribute the workload among different workers. In this paper, we study the impact of different hardware configurations on the performance of Halvade. Benchmarks indicate that especially the lack of good multithreading capabilities in the existing tools (BWA, SAMtools, Picard, GATK) cause suboptimal scaling behavior. We demonstrate that it is possible to circumvent this bottleneck by using multiprocessing on high-memory machines rather than using multithreading. Using a 15-node cluster with 360 CPU cores in total, this results in a runtime of 1 h 31 min. Compared to a single-threaded runtime of \(\sim \)12 days, this corresponds to an overall parallel efficiency of 53 %.

Keywords

DNA sequencing MapReduce Hadoop Cloudera Distributed file systems 

Notes

Acknowledgments

This work is funded by Intel, Janssen Pharmaceutica and by the Institute of the Promotion of Innovation through Science and Technology in Flanders (IWT). The computational resources (Stevin Supercomputer Infrastructure) and services used in this work were provided by the VSC (Flemish Supercomputer Center), funded by Ghent University, the Hercules Foundation and the Flemish Government department EWI. Special thanks goes to Stijn De Weirdt for his assistance with the Java wrappers to improve NUMA locality. Benchmarks on Lustre were run at the Intel Big Data Lab, Swindon, UK. We acknowledge the support of Ghent University (Multidisciplinary Research Partnership Bioinformatics: From Nucleotides to Networks).

References

  1. 1.
    Van der Auwera, G.A., Carneiro, M.O., Hartl, C., Poplin, R., del Angel, G., Levy-Moonshine, A., Jordan, T., Shakir, K., Roazen, D., Thibault, J., Banks, E., Garimella, K.V., Altshuler, D., Gabriel, S., DePristo, M.A.: From FastQ data to high-confidence variant calls: the genome analysis toolkit best practices pipeline. Curr. Protoc. Bioinformat. 43, 11.10.1–11.10.33 (2013)Google Scholar
  2. 2.
    Li, H., Durbin, R.: Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25, 1754–1760 (2009)CrossRefGoogle Scholar
  3. 3.
    Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., Marth, G., Abecasis, G., Durbin, R.: The sequence alignment/map format and SAMtools. Bioinformatics 25, 2078–2079 (2009)CrossRefGoogle Scholar
  4. 4.
  5. 5.
    McKenna, A., Hanna, M., Banks, E., Sivachenko, A., Cibulskis, K., Kernytsky, A., Garimella, K., Altshuler, D., Gabriel, S., Daly, M., DePristo, M.A.: The genome analysis toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20, 1297–1303 (2010)CrossRefGoogle Scholar
  6. 6.
    Depristo, M.A., Banks, E., Poplin, R., Garimella, K.V., Maquire, J.R., Hartl, C., Philippakis, A.A., del Angel, G., Rivas, M.A., Hanna, M., McKenna, A., Fennell, T.J., Kernytsky, A.M., Sivachenko, A.Y., Cibulskis, K., Gabriel, S.B., Altshuler, D., Daly, M.J.: A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat. Genet. 43, 491–498 (2011)CrossRefGoogle Scholar
  7. 7.
    Decap, D., Reumers, J., Herzeel, C., Costanza, P., Fostier, J.: Halvade: scalable sequence analysis with MapReduce. Bioinformatics 31, 2482–2488 (2015)CrossRefGoogle Scholar
  8. 8.
    Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51, 107–113 (2008)CrossRefGoogle Scholar
  9. 9.
  10. 10.
    Sherry, S.T., Ward, M.H., Kholodov, M., Phan, L., Smigielsky, E.M., Sirotkin, K.: dbSNP: the NCBI database of genetic variation. Nucleic Acids Res. 29(1), 308–311 (2001)CrossRefGoogle Scholar
  11. 11.
  12. 12.
  13. 13.
    Hatem, A., Bozda, D., Toland, A.E., Catalyurek, V.: Benchmarking short sequence mapping tools. BMC Bioinform. 14, 184 (2013)CrossRefGoogle Scholar
  14. 14.
    Kutlu, M., Agrawal, G.: PAGE: a framework for easy parallelization of genomic applications. In: IPDPS (2014)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  • Dries Decap
    • 1
    • 5
  • Joke Reumers
    • 2
    • 5
  • Charlotte Herzeel
    • 3
    • 5
  • Pascal Costanza
    • 4
    • 5
  • Jan Fostier
    • 1
    • 5
  1. 1.Department of Information TechnologyGhent University - iMindsGhentBelgium
  2. 2.Janssen Research & Development, A Division of Janssen Pharmaceutica N.V.BeerseBelgium
  3. 3.ImecLeuvenBelgium
  4. 4.Intel CorporationBrusselsBelgium
  5. 5.ExaScience Life LabLeuvenBelgium

Personalised recommendations