Performance Analysis and Optimization of SAMtools Sorting
SAMtools is a suite of tools that is widely-used in genomics workflows for post-processing sequence alignment data from large high-throughput sequencing data sets. A common use of SAMtools is to sort the standard Binary Alignment/Map (BAM) format emitted by many sequence aligners. This can be computationally- and I/O-intensive: BAM files can be many gigabytes in size, and may need to be decompressed before sorting and compressed afterwards. As a result, BAM-file sorting can be a bottleneck in genomics workflows. This paper presents a case study on the performance characterization and optimization of BAM sorting with SAMtools. OpenMP task parallelism to enhance concurrency and memory optimization techniques were employed in both SAMtools and the underlying library HTSlib. Utilizing all 32 processor cores on the benchmark system, the optimizations resulted in a speedup of 3.92X for an in-memory sort of 24.6 GiB of BAM data (102.6 GiB uncompressed), while a 1.55X speedup was achieved for an out-of-core sort.
KeywordsBioinformatics High-throughput sequencing OpenMP
The authors thank Marina Kraeva for her careful proofreading of the manuscript.
This research used resources of the National Energy Research Scientific Computing Center, a DOE Office of Science User Facility supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231.
- 2.Bonfield, J.K.: The scramble conversion tool. Bioinformatics 30(19), 2818–2819 (2014). http://bioinformatics.oxfordjournals.org/content/30/19/2818.abstract CrossRefGoogle Scholar
- 3.1000 Genomes Project Consortium, et al.: A global reference for human genetic variation. Nature 526(7571), 68–74 (2015)Google Scholar
- 5.Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., Marth, G., Abecasis, G., Durbin, R., Subgroup, G.: The sequence alignment/map format and SAMtools. Bioinformatics 25(16), 2078–2079 (2009). http://bioinformatics.oxfordjournals.org/content/25/16/2078.abstract CrossRefGoogle Scholar
- 8.Tarasov, A., Vilella, A.J., Cuppen, E., Nijman, I.J., Prins, P.: Sambamba: fast processing of NGS alignment formats. Bioinformatics 31(12), 2032–2034 (2015). http://bioinformatics.oxfordjournals.org/content/31/12/2032.abstract CrossRefGoogle Scholar
- 9.Wetterstrand, K.: DNA Sequencing Costs: Data from the NHGRI Genome Sequencing Program (GSP). http://www.genome.gov/sequencingcostsdata