Cluster Computing

, Volume 20, Issue 3, pp 1869–1880 | Cite as

Optimization of SAMtools sorting using OpenMP tasks

Article

Abstract

SAMtools is a widely-used genomics application for post-processing high-throughput sequence alignment data. Such sequence alignment data are commonly sorted to make downstream analysis more efficient. However, this sorting process itself can be computationally- and I/O-intensive: high-throughput sequence alignment files in the de facto standard binary alignment/map (BAM) format can be many gigabytes in size, and may need to be decompressed before sorting and compressed afterwards. As a result, BAM-file sorting can be a bottleneck in genomics workflows. This paper describes a case study on the performance analysis and optimization of SAMtools for sorting large BAM files. OpenMP task parallelism and memory optimization techniques resulted in a speedup of 5.9X versus the upstream SAMtools 1.3.1 for an internal (in-memory) sort of 24.6 GiB of compressed BAM data (102.6 GiB uncompressed) with 32 processor cores, while a 1.98X speedup was achieved for an external (out-of-core) sort of a 271.4 GiB BAM file.

Keywords

Bioinformatics High-throughput sequencing OpenMP Sorting Burst buffer 

References

  1. 1.
    Adhianto, L., Banerjee, S., Fagan, M., Krentel, M., Marin, G., Mellor-Crummey, J., Tallent, N.R.: HPCTOOLKIT: tools for performance analysis of optimized parallel programs. Concurr. Comput. Pract. Exp. 22(6), 685–701 (2010). doi:10.1002/cpe.1553 Google Scholar
  2. 2.
    Bhimji, W., Bard, D., Romanus, M., Paul, D., Ovsyannikov, A., Friesen, B., Bryson, M., Correa, J., Lockwood, G.K., Tsulaia, V., et al.: Accelerating science with the NERSC burst buffer early user program. In: 2016 Cray User Group (CUG 2016) (2016). https://cug.org/proceedings/cug2016_proceedings/includes/files/pap162.pdf
  3. 3.
    Bonfield, J.K.: The Scramble conversion tool. Bioinformatics 30(19), 2818–2819 (2014). doi:10.1093/bioinformatics/btu390 CrossRefGoogle Scholar
  4. 4.
    Consortium TGP: Nature A global reference for human genetic variation. 526(7571), 68–74 (2015). doi:10.1038/nature15393
  5. 5.
    Declerck, T., Antypas, K., Bard, D, Bhimji, W., Canon, S., Cholia, S., He, H.Y., Jacobsen, D., Prabhat, N.J.W.: Cori-A system to support data-intensive computing. In: 2016 Cray User Group (CUG 2016) (2016). https://cug.org/proceedings/cug2016_proceedings/includes/files/pap171.pdf
  6. 6.
    Diekmann, R., Gehring, J., Luling, R., Monien, B., Nubel, M., Wanka, R.: Sorting large data sets on a massively parallel system. In: Proceedings of 1994 6th IEEE Symposium on Parallel and Distributed Processing, pp. 2–9 (1994). 10.1109/SPDP.1994.346188
  7. 7.
    Faust, G.G., Hall, I.M.: SAMBLASTER: fast duplicate marking and structural variant read extraction. Bioinformatics 30(17), 2503–2505 (2014). doi:10.1093/bioinformatics/btu314 CrossRefGoogle Scholar
  8. 8.
    Herzeel, C., Costanza, P., Decap, D., Fostier, J., Reumers, J.: elPrep: high-performance preparation of sequence alignment/map files for variant calling. PLoS ONE 10(7), 1–16 (2015). doi:10.1371/journal.pone.0132868 CrossRefGoogle Scholar
  9. 9.
    Intel Corporation: Programming Intel QuickAssist Technology Hardware Accelerators for Optimal Performance. Technical reports (2015). https://01.org/sites/default/files/page/332125_002_0.pdf
  10. 10.
    Kelly, B.J., Fitch, J.R., Hu, Y., Corsmeier, D.J., Zhong, H., Wetzel, A.N., Nordquist, R.D., Newsom, D.L., White, P.: Churchill: an ultra-fast, deterministic, highly scalable and balanced parallelization strategy for the discovery of human genetic variation in clinical and population-scale genomics. Genome Biol. 16(1), 6 (2015). doi:10.1186/s13059-014-0577-x CrossRefGoogle Scholar
  11. 11.
    Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., Marth, G., Abecasis, G., Durbin, R., Subgroup, G.P.D.P.: The sequence alignment/map format and SAMtools. Bioinformatics 25(16), 2078–2079 (2009). doi:10.1093/bioinformatics/btp352 CrossRefGoogle Scholar
  12. 12.
    Lin, M.: Faster BAM sorting with SAMtools and RocksDB (2014). http://devblog.dnanexus.com/faster-bam-sorting-with-samtools-and-rocksdb/
  13. 13.
    Mellor-Crummey, J.M., Scott, M.L.: Algorithms for scalable synchronization on shared-memory multiprocessors. ACM Trans. Comput. Syst. 9(1), 21–65 (1991). doi:10.1145/103727.103729 CrossRefGoogle Scholar
  14. 14.
    OpenMP Architecture Review Board (2013) OpenMP Application Program Interface, Version 4.0. http://www.openmp.org/resources/openmp-compilers/
  15. 15.
  16. 16.
    Puckelwartz, M.J., Pesce, L.L., Nelakuditi, V., Dellefave-Castillo, L., Golbus, J.R., Day, S.M., Cappola, T.P., Dorn II, G.W., Foster, I.T., McNally, E.M.: Supercomputing for the parallelization of whole genome analysis. Bioinformatics 30(11), 1508 (2014). doi:10.1093/bioinformatics/btu071 CrossRefGoogle Scholar
  17. 17.
    Raczy, C., Petrovski, R., Saunders, C.T., Chorny, I., Kruglyak, S., Margulies, E.H., Chuang, H.Y., Kllberg, M., Kumar, S.A., Liao, A., Little, K.M., Strmberg, M.P., Tanner, S.W.: Isaac: ultra-fast whole-genome secondary analysis on illumina sequencing platforms. Bioinformatics 29(16), 2041 (2013). doi:10.1093/bioinformatics/btt314 CrossRefGoogle Scholar
  18. 18.
    Rengasamy, V., Madduri, K.: SPRITE: a fast parallel SNP detection pipeline, pp. 159–177. Springer, Cham (2016). doi:10.1007/978-3-319-41321-1_9 Google Scholar
  19. 19.
    Stephens, Z.D., Lee, S.Y., Faghri, F., Campbell, R.H., Zhai, C., Efron, M.J., Iyer, R., Schatz, M.C., Sinha, S., Robinson, G.E.: Big data: astronomical or genomical? PLoS Biol. 13(7), 1–11 (2015). doi:10.1371/journal.pbio.1002195 CrossRefGoogle Scholar
  20. 20.
    Tarasov, A., Vilella, A.J., Cuppen, E., Nijman, I.J., Prins, P.: Sambamba: fast processing of NGS alignment formats. Bioinformatics 31(12), 2032–2034 (2015). doi:10.1093/bioinformatics/btv098 CrossRefGoogle Scholar
  21. 21.
    Tischler, G.: biobambam2 (2017). https://github.com/gt1/biobambam2
  22. 22.
    Weeks, N.T., Luecke, G.R.: Performance analysis and optimization of SAMtools sorting. In: 4th International Workshop on Parallelism in Bioinformatics (PBio2016) (in press)Google Scholar
  23. 23.
    Wetterstrand, K.: DNA Sequencing costs: data from the NHGRI genome sequencing program (GSP) (2016). http://www.genome.gov/sequencingcostsdata

Copyright information

© Springer Science+Business Media New York 2017

Authors and Affiliations

  1. 1.Department of Computer ScienceIowa State UniversityAmesUSA
  2. 2.Department of MathematicsIowa State UniversityAmesUSA

Personalised recommendations