Abstract
SAMtools is a widely-used genomics application for post-processing high-throughput sequence alignment data. Such sequence alignment data are commonly sorted to make downstream analysis more efficient. However, this sorting process itself can be computationally- and I/O-intensive: high-throughput sequence alignment files in the de facto standard binary alignment/map (BAM) format can be many gigabytes in size, and may need to be decompressed before sorting and compressed afterwards. As a result, BAM-file sorting can be a bottleneck in genomics workflows. This paper describes a case study on the performance analysis and optimization of SAMtools for sorting large BAM files. OpenMP task parallelism and memory optimization techniques resulted in a speedup of 5.9X versus the upstream SAMtools 1.3.1 for an internal (in-memory) sort of 24.6 GiB of compressed BAM data (102.6 GiB uncompressed) with 32 processor cores, while a 1.98X speedup was achieved for an external (out-of-core) sort of a 271.4 GiB BAM file.
Similar content being viewed by others
Notes
Source code for SAMtools optimizations available at https://doi.org/10.5281/zenodo.262169, and HTSlib optimizations at https://doi.org/10.5281/zenodo.262161
Note that taskyield is a no-op as of gcc 6.2.0
This check could occur after task generation and before returning from the routine; however, the implementation did not consistently perform as well in practice, possibly due to an undetermined effect on task scheduling.
References
Adhianto, L., Banerjee, S., Fagan, M., Krentel, M., Marin, G., Mellor-Crummey, J., Tallent, N.R.: HPCTOOLKIT: tools for performance analysis of optimized parallel programs. Concurr. Comput. Pract. Exp. 22(6), 685–701 (2010). doi:10.1002/cpe.1553
Bhimji, W., Bard, D., Romanus, M., Paul, D., Ovsyannikov, A., Friesen, B., Bryson, M., Correa, J., Lockwood, G.K., Tsulaia, V., et al.: Accelerating science with the NERSC burst buffer early user program. In: 2016 Cray User Group (CUG 2016) (2016). https://cug.org/proceedings/cug2016_proceedings/includes/files/pap162.pdf
Bonfield, J.K.: The Scramble conversion tool. Bioinformatics 30(19), 2818–2819 (2014). doi:10.1093/bioinformatics/btu390
Consortium TGP: Nature A global reference for human genetic variation. 526(7571), 68–74 (2015). doi:10.1038/nature15393
Declerck, T., Antypas, K., Bard, D, Bhimji, W., Canon, S., Cholia, S., He, H.Y., Jacobsen, D., Prabhat, N.J.W.: Cori-A system to support data-intensive computing. In: 2016 Cray User Group (CUG 2016) (2016). https://cug.org/proceedings/cug2016_proceedings/includes/files/pap171.pdf
Diekmann, R., Gehring, J., Luling, R., Monien, B., Nubel, M., Wanka, R.: Sorting large data sets on a massively parallel system. In: Proceedings of 1994 6th IEEE Symposium on Parallel and Distributed Processing, pp. 2–9 (1994). 10.1109/SPDP.1994.346188
Faust, G.G., Hall, I.M.: SAMBLASTER: fast duplicate marking and structural variant read extraction. Bioinformatics 30(17), 2503–2505 (2014). doi:10.1093/bioinformatics/btu314
Herzeel, C., Costanza, P., Decap, D., Fostier, J., Reumers, J.: elPrep: high-performance preparation of sequence alignment/map files for variant calling. PLoS ONE 10(7), 1–16 (2015). doi:10.1371/journal.pone.0132868
Intel Corporation: Programming Intel QuickAssist Technology Hardware Accelerators for Optimal Performance. Technical reports (2015). https://01.org/sites/default/files/page/332125_002_0.pdf
Kelly, B.J., Fitch, J.R., Hu, Y., Corsmeier, D.J., Zhong, H., Wetzel, A.N., Nordquist, R.D., Newsom, D.L., White, P.: Churchill: an ultra-fast, deterministic, highly scalable and balanced parallelization strategy for the discovery of human genetic variation in clinical and population-scale genomics. Genome Biol. 16(1), 6 (2015). doi:10.1186/s13059-014-0577-x
Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., Marth, G., Abecasis, G., Durbin, R., Subgroup, G.P.D.P.: The sequence alignment/map format and SAMtools. Bioinformatics 25(16), 2078–2079 (2009). doi:10.1093/bioinformatics/btp352
Lin, M.: Faster BAM sorting with SAMtools and RocksDB (2014). http://devblog.dnanexus.com/faster-bam-sorting-with-samtools-and-rocksdb/
Mellor-Crummey, J.M., Scott, M.L.: Algorithms for scalable synchronization on shared-memory multiprocessors. ACM Trans. Comput. Syst. 9(1), 21–65 (1991). doi:10.1145/103727.103729
OpenMP Architecture Review Board (2013) OpenMP Application Program Interface, Version 4.0. http://www.openmp.org/resources/openmp-compilers/
Puckelwartz, M.J., Pesce, L.L., Nelakuditi, V., Dellefave-Castillo, L., Golbus, J.R., Day, S.M., Cappola, T.P., Dorn II, G.W., Foster, I.T., McNally, E.M.: Supercomputing for the parallelization of whole genome analysis. Bioinformatics 30(11), 1508 (2014). doi:10.1093/bioinformatics/btu071
Raczy, C., Petrovski, R., Saunders, C.T., Chorny, I., Kruglyak, S., Margulies, E.H., Chuang, H.Y., Kllberg, M., Kumar, S.A., Liao, A., Little, K.M., Strmberg, M.P., Tanner, S.W.: Isaac: ultra-fast whole-genome secondary analysis on illumina sequencing platforms. Bioinformatics 29(16), 2041 (2013). doi:10.1093/bioinformatics/btt314
Rengasamy, V., Madduri, K.: SPRITE: a fast parallel SNP detection pipeline, pp. 159–177. Springer, Cham (2016). doi:10.1007/978-3-319-41321-1_9
Stephens, Z.D., Lee, S.Y., Faghri, F., Campbell, R.H., Zhai, C., Efron, M.J., Iyer, R., Schatz, M.C., Sinha, S., Robinson, G.E.: Big data: astronomical or genomical? PLoS Biol. 13(7), 1–11 (2015). doi:10.1371/journal.pbio.1002195
Tarasov, A., Vilella, A.J., Cuppen, E., Nijman, I.J., Prins, P.: Sambamba: fast processing of NGS alignment formats. Bioinformatics 31(12), 2032–2034 (2015). doi:10.1093/bioinformatics/btv098
Tischler, G.: biobambam2 (2017). https://github.com/gt1/biobambam2
Weeks, N.T., Luecke, G.R.: Performance analysis and optimization of SAMtools sorting. In: 4th International Workshop on Parallelism in Bioinformatics (PBio2016) (in press)
Wetterstrand, K.: DNA Sequencing costs: data from the NHGRI genome sequencing program (GSP) (2016). http://www.genome.gov/sequencingcostsdata
Acknowledgements
This research used resources of the National Energy Research Scientific Computing Center, a DOE Office of Science User Facility supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231.
Author information
Authors and Affiliations
Corresponding author
Electronic Supplementary Material
Rights and permissions
About this article
Cite this article
Weeks, N.T., Luecke, G.R. Optimization of SAMtools sorting using OpenMP tasks. Cluster Comput 20, 1869–1880 (2017). https://doi.org/10.1007/s10586-017-0874-8
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10586-017-0874-8