Performance Analysis and Optimization of SAMtools Sorting

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10104)

Abstract

SAMtools is a suite of tools that is widely-used in genomics workflows for post-processing sequence alignment data from large high-throughput sequencing data sets. A common use of SAMtools is to sort the standard Binary Alignment/Map (BAM) format emitted by many sequence aligners. This can be computationally- and I/O-intensive: BAM files can be many gigabytes in size, and may need to be decompressed before sorting and compressed afterwards. As a result, BAM-file sorting can be a bottleneck in genomics workflows. This paper presents a case study on the performance characterization and optimization of BAM sorting with SAMtools. OpenMP task parallelism to enhance concurrency and memory optimization techniques were employed in both SAMtools and the underlying library HTSlib. Utilizing all 32 processor cores on the benchmark system, the optimizations resulted in a speedup of 3.92X for an in-memory sort of 24.6 GiB of BAM data (102.6 GiB uncompressed), while a 1.55X speedup was achieved for an out-of-core sort.

Keywords

Bioinformatics High-throughput sequencing OpenMP 

Notes

Acknowledgment

The authors thank Marina Kraeva for her careful proofreading of the manuscript.

This research used resources of the National Energy Research Scientific Computing Center, a DOE Office of Science User Facility supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231.

References

  1. 1.
    Adhianto, L., Banerjee, S., Fagan, M., Krentel, M., Marin, G., Mellor-Crummey, J., Tallent, N.R.: HPCToolkit: tools for performance analysis of optimized parallel programs. Concurr. Comput.: Pract. Exp. 22(6), 685–701 (2010). http://dx.doi.org/10.1002/cpe.1553 Google Scholar
  2. 2.
    Bonfield, J.K.: The scramble conversion tool. Bioinformatics 30(19), 2818–2819 (2014). http://bioinformatics.oxfordjournals.org/content/30/19/2818.abstract CrossRefGoogle Scholar
  3. 3.
    1000 Genomes Project Consortium, et al.: A global reference for human genetic variation. Nature 526(7571), 68–74 (2015)Google Scholar
  4. 4.
    Herzeel, C., Costanza, P., Decap, D., Fostier, J., Reumers, J.: elPrep: high-performance preparation of sequence alignment/map files for variant calling. PLoS ONE 10(7), 1–16 (2015). http://dx.doi.org/10.1371/journal.pone.0132868 CrossRefGoogle Scholar
  5. 5.
    Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., Marth, G., Abecasis, G., Durbin, R., Subgroup, G.: The sequence alignment/map format and SAMtools. Bioinformatics 25(16), 2078–2079 (2009). http://bioinformatics.oxfordjournals.org/content/25/16/2078.abstract CrossRefGoogle Scholar
  6. 6.
    Mellor-Crummey, J.M., Scott, M.L.: Algorithms for scalable synchronization on shared-memory multiprocessors. ACM Trans. Comput. Syst. 9(1), 21–65 (1991). http://doi.acm.org/10.1145/103727.103729 CrossRefGoogle Scholar
  7. 7.
    Stephens, Z.D., Lee, S.Y., Faghri, F., Campbell, R.H., Zhai, C., Efron, M.J., Iyer, R., Schatz, M.C., Sinha, S., Robinson, G.E.: Big data: astronomical or genomical? PLoS Biol. 13(7), 1–11 (2015). http://dx.doi.org/10.1371/journal.pbio.1002195 CrossRefGoogle Scholar
  8. 8.
    Tarasov, A., Vilella, A.J., Cuppen, E., Nijman, I.J., Prins, P.: Sambamba: fast processing of NGS alignment formats. Bioinformatics 31(12), 2032–2034 (2015). http://bioinformatics.oxfordjournals.org/content/31/12/2032.abstract CrossRefGoogle Scholar
  9. 9.
    Wetterstrand, K.: DNA Sequencing Costs: Data from the NHGRI Genome Sequencing Program (GSP). http://www.genome.gov/sequencingcostsdata

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  1. 1.Department of Computer ScienceIowa State UniversityAmesUSA
  2. 2.Department of MathematicsIowa State UniversityAmesUSA

Personalised recommendations