Advertisement

Counting Kmers for Biological Sequences at Large Scale

  • Jianqiu Ge
  • Jintao Meng
  • Ning Guo
  • Yanjie WeiEmail author
  • Pavan Balaji
  • Shengzhong Feng
Original research article

Abstract

Counting the abundance of all the distinct kmers in biological sequence data is a fundamental step in bioinformatics. These applications include de novo genome assembly, error correction, etc. With the development of sequencing technology, the sequence data in a single project can reach Petabyte-scale or Terabyte-scale nucleotides. Counting demand for the abundance of these sequencing data is beyond the memory and computing capacity of single computing node, and how to process it efficiently is a challenge on a high-performance computing cluster. As such, we propose SWAPCounter, a highly scalable distributed approach for kmer counting. This approach is embedded with an MPI streaming I/O module for loading huge data set at high speed, and a counting bloom filter module for both memory and communication efficiency. By overlapping all the counting steps, SWAPCounter achieves high scalability with high parallel efficiency. The experimental results indicate that SWAPCounter has competitive performance with two other tools on shared memory environment, KMC2, and MSPKmerCounter. Moreover, SWAPCounter also shows the highest scalability under strong scaling experiments. In our experiment on Cetus supercomputer, SWAPCounter scales to 32,768 cores with 79% parallel efficiency (using 2048 cores as baseline) when processing 4 TB sequence data of 1000 Genomes. The source code of SWAPCounter is publicly available at https://github.com/mengjintao/SWAPCounter.

Keywords

Kmer counting Biological sequence Counting bloom filter Scalability 

Notes

Acknowledgements

This work is supported by the National Key Research and Development Program of China under Grant nos. 2016YFB0201305 and 2018YFB0204403; National Science Foundation of China under Grant nos. U1435215 and 61433012; the Shenzhen Basic Research Fund under Grant nos. JCYJ20160331190123578, JCYJ20170413093358429, and GGFW2017073114031767; Chinese Academy of Sciences Grant under no. 2019VBA0009. We would also like to thank the funding support by the Shenzhen Discipline Construction Project for Urban Computing and Data Intelligence, Youth Innovation Promotion Association, and CAS to Yanjie Wei.

References

  1. 1.
    Zou Q, Li X, Jiang W, Lin Z, Li G, Chen K (2014) Survey of mapreduce frame operation in bioinformatics. Brief Bioinform 15(4):637–647CrossRefGoogle Scholar
  2. 2.
    Guo R, Zhao Y, Zou Q, Fang X, Peng S (2018) Bioinformatics applications on apache spark. GigaScience 7(8):giy098PubMedCentralGoogle Scholar
  3. 3.
    Miller JR, Koren S, Sutton GG (2010) Assembly algorithms for next-generation sequencing data. Genomics 95(6):315–327CrossRefGoogle Scholar
  4. 4.
    Pevzner PA, Tang H, Waterman MS (2001) An eulerian path approach to DNA fragment assembly. Proc Nat Acad Sci 98(17):9748–9753CrossRefGoogle Scholar
  5. 5.
    Meng J, Wang B, Wei Y, Feng S, Balaji P (2014) Swap-assembler: scalable and efficient genome assembly towards thousands of cores. BMC Bioinform BioMed Central 15(9):S2CrossRefGoogle Scholar
  6. 6.
    Meng J, Seo S, Balaji P, Wei Y, Wang B, Feng S (2016) Swap-assembler 2: optimization of de novo genome assembler at extreme scale. In: Parallel processing (ICPP), 2016 45th international conference on. IEEE, pp 195–204Google Scholar
  7. 7.
    Simpson JT, Wong K, Jackman SD, Schein JE, Jones SJ, Birol I (2009) Abyss: a parallel assembler for short read sequence data. Genome Res 19(6):1117–1123CrossRefGoogle Scholar
  8. 8.
    Kelley DR, Schatz MC, Salzberg SL (2010) Quake: quality-aware detection and correction of sequencing errors. Genome Biol 11(11):R116CrossRefGoogle Scholar
  9. 9.
    Liu Y, Schröder J, Schmidt B (2012) Musket: a multistage k-mer spectrum-based error corrector for illumina sequence data. Bioinformatics 29(3):308–315CrossRefGoogle Scholar
  10. 10.
    Sheikhizadeh S, De Ridder D (2015) Ace: accurate correction of errors using k-mer tries. Bioinformatics 31(19):3216–3218CrossRefGoogle Scholar
  11. 11.
    Medvedev P, Scott E, Kakaradov B, Pevzner P (2011) Error correction of high-throughput sequencing datasets with non-uniform coverage. Bioinformatics 27(13):i137–i141CrossRefGoogle Scholar
  12. 12.
    Kent WJ (2002) Blat-the blast-like alignment tool. Genome Res 12(4):656–664CrossRefGoogle Scholar
  13. 13.
    Qin J, Li R, Raes J, Arumugam M, Burgdorf KS, Manichanh C, Nielsen T, Pons N, Levenez F, Yamada T et al (2010) A human gut microbial gene catalogue established by metagenomic sequencing. Nature 464(7285):59CrossRefGoogle Scholar
  14. 14.
    Marçais G, Kingsford C (2011) A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics 27(6):764–770CrossRefGoogle Scholar
  15. 15.
    Li Y et al (2015) Mspkmercounter: a fast and memory efficient approach for k-mer counting. arXiv:1505.06550 (arXiv preprint)
  16. 16.
    Li Y, Kamousi P, Han F, Yang S, Yan X, Suri S (2013) Memory efficient minimum substring partitioning. Very Large Data Bases 6(3):169–180Google Scholar
  17. 17.
    Melsted P, Pritchard JK (2011) Efficient counting of k-mers in dna sequences using a bloom filter. BMC Bioinform 12(1):333CrossRefGoogle Scholar
  18. 18.
    Deorowicz S, Kokot M, Grabowski S, Debudaj-Grabysz A (2015) Kmc 2: fast and resource-frugal k-mer counting. Bioinformatics 31(10):1569–1576CrossRefGoogle Scholar
  19. 19.
    Rizk G, Lavenier D, Chikhi R (2013) Dsk: k-mer counting with very low memory usage. Bioinformatics 29(5):652–653CrossRefGoogle Scholar
  20. 20.
    Zhang Q, Pell J, Caninokoning R, Howe A, Brown CT (2014) These are not the k-mers you are looking for: efficient online k-mer counting using a probabilistic data structure. PLoS One 9:7Google Scholar
  21. 21.
    Roy RS, Bhattacharya D, Schliep A (2014) Turtle: identifying frequent k-mers with cache-efficient algorithms. Bioinformatics 30(14):1950–1957CrossRefGoogle Scholar
  22. 22.
    Perez N, Gutierrez M, Vera N (2016) Computational performance assessment of k-mer counting algorithms. J Comput Biol 23(4):248–255CrossRefGoogle Scholar
  23. 23.
    Pan T, Flick P, Jain C, Liu Y, Aluru S (2017) Kmerind: a flexible parallel library for k-mer indexing of biological sequences on distributed memory systems. IEEE/ACM Trans Comput BGoogle Scholar
  24. 24.
    Gao T, Guo Y, Wei Y, Wang B, Lu Y, Cicotti P, Balaji P, Taufer M (2017) Bloomfish: a highly scalable distributed k-mer counting framework. In: ICPADS IEEE international conference on parallel and distributed systems, IEEE. Shenzhen, China: IEEE. [Online]. http://www.futurenet.ac.cn/icpads2017/?program-Gid_33.html
  25. 25.
    Georganas E, Buluç A, Chapman J, Hofmeyr S, Aluru C, Egan R, Oliker L, Rokhsar D, Yelick K (2015) Hipmer: an extreme-scale de novo genome assembler. In: Proceedings of the international conference for high performance computing. ACM, networking, storage and analysis, p 14Google Scholar
  26. 26.
    Georganas E, Buluç A, Chapman J, Oliker L, Rokhsar D, Yelick K (2014) Parallel de bruijn graph construction and traversal for de novo genome assembly. In: Proceedings of the international conference for high performance computing, networking, storage and analysis. IEEE Press, pp 437–448Google Scholar
  27. 27.
    Gao T, Guo Y, Zhang B, Cicotti P, Lu Y, Balaji P, Taufer M (2017) Mimir: memory-efficient and scalable mapreduce for large supercomputing systems. In: Parallel and distributed processing symposium (IPDPS), IEEE international. IEEE 2017, pp 1098–1108Google Scholar
  28. 28.
    Blustein J, El-Maazawi A (2002) Bloom filters: a tutorial, analysis, and survey. Dalhousie University, Halifax, pp 1–31Google Scholar
  29. 29.
  30. 30.
    Roberts M, Hayes W, Hunt BR, Mount SM, Yorke JA (2004) Reducing storage requirements for biological sequence comparison. Bioinformatics 20(18):3363–3369CrossRefGoogle Scholar
  31. 31.
    Cormode G, Muthukrishnan S (2005) An improved data stream summary: the count-min sketch and its applications. J Algorithms 55(1):58–75CrossRefGoogle Scholar
  32. 32.
    Unified parallel c. http://upc.lbl.gov/

Copyright information

© International Association of Scientists in the Interdisciplinary Areas 2019

Authors and Affiliations

  • Jianqiu Ge
    • 1
  • Jintao Meng
    • 1
  • Ning Guo
    • 1
  • Yanjie Wei
    • 1
    Email author
  • Pavan Balaji
    • 2
  • Shengzhong Feng
    • 1
  1. 1.Shenzhen Institutes of Advanced Technology, Chinese Academy of SciencesBeijingChina
  2. 2.Mathematics and Computer Science DivisionArgonne National LaboratoryLemontUSA

Personalised recommendations