Abstract
Counting the abundance of all the distinct kmers in biological sequence data is a fundamental step in bioinformatics. These applications include de novo genome assembly, error correction, etc. With the development of sequencing technology, the sequence data in a single project can reach Petabyte-scale or Terabyte-scale nucleotides. Counting demand for the abundance of these sequencing data is beyond the memory and computing capacity of single computing node, and how to process it efficiently is a challenge on a high-performance computing cluster. As such, we propose SWAPCounter, a highly scalable distributed approach for kmer counting. This approach is embedded with an MPI streaming I/O module for loading huge data set at high speed, and a counting bloom filter module for both memory and communication efficiency. By overlapping all the counting steps, SWAPCounter achieves high scalability with high parallel efficiency. The experimental results indicate that SWAPCounter has competitive performance with two other tools on shared memory environment, KMC2, and MSPKmerCounter. Moreover, SWAPCounter also shows the highest scalability under strong scaling experiments. In our experiment on Cetus supercomputer, SWAPCounter scales to 32,768 cores with 79% parallel efficiency (using 2048 cores as baseline) when processing 4 TB sequence data of 1000 Genomes. The source code of SWAPCounter is publicly available at https://github.com/mengjintao/SWAPCounter.
Similar content being viewed by others
References
Zou Q, Li X, Jiang W, Lin Z, Li G, Chen K (2014) Survey of mapreduce frame operation in bioinformatics. Brief Bioinform 15(4):637–647
Guo R, Zhao Y, Zou Q, Fang X, Peng S (2018) Bioinformatics applications on apache spark. GigaScience 7(8):giy098
Miller JR, Koren S, Sutton GG (2010) Assembly algorithms for next-generation sequencing data. Genomics 95(6):315–327
Pevzner PA, Tang H, Waterman MS (2001) An eulerian path approach to DNA fragment assembly. Proc Nat Acad Sci 98(17):9748–9753
Meng J, Wang B, Wei Y, Feng S, Balaji P (2014) Swap-assembler: scalable and efficient genome assembly towards thousands of cores. BMC Bioinform BioMed Central 15(9):S2
Meng J, Seo S, Balaji P, Wei Y, Wang B, Feng S (2016) Swap-assembler 2: optimization of de novo genome assembler at extreme scale. In: Parallel processing (ICPP), 2016 45th international conference on. IEEE, pp 195–204
Simpson JT, Wong K, Jackman SD, Schein JE, Jones SJ, Birol I (2009) Abyss: a parallel assembler for short read sequence data. Genome Res 19(6):1117–1123
Kelley DR, Schatz MC, Salzberg SL (2010) Quake: quality-aware detection and correction of sequencing errors. Genome Biol 11(11):R116
Liu Y, Schröder J, Schmidt B (2012) Musket: a multistage k-mer spectrum-based error corrector for illumina sequence data. Bioinformatics 29(3):308–315
Sheikhizadeh S, De Ridder D (2015) Ace: accurate correction of errors using k-mer tries. Bioinformatics 31(19):3216–3218
Medvedev P, Scott E, Kakaradov B, Pevzner P (2011) Error correction of high-throughput sequencing datasets with non-uniform coverage. Bioinformatics 27(13):i137–i141
Kent WJ (2002) Blat-the blast-like alignment tool. Genome Res 12(4):656–664
Qin J, Li R, Raes J, Arumugam M, Burgdorf KS, Manichanh C, Nielsen T, Pons N, Levenez F, Yamada T et al (2010) A human gut microbial gene catalogue established by metagenomic sequencing. Nature 464(7285):59
Marçais G, Kingsford C (2011) A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics 27(6):764–770
Li Y et al (2015) Mspkmercounter: a fast and memory efficient approach for k-mer counting. arXiv:1505.06550 (arXiv preprint)
Li Y, Kamousi P, Han F, Yang S, Yan X, Suri S (2013) Memory efficient minimum substring partitioning. Very Large Data Bases 6(3):169–180
Melsted P, Pritchard JK (2011) Efficient counting of k-mers in dna sequences using a bloom filter. BMC Bioinform 12(1):333
Deorowicz S, Kokot M, Grabowski S, Debudaj-Grabysz A (2015) Kmc 2: fast and resource-frugal k-mer counting. Bioinformatics 31(10):1569–1576
Rizk G, Lavenier D, Chikhi R (2013) Dsk: k-mer counting with very low memory usage. Bioinformatics 29(5):652–653
Zhang Q, Pell J, Caninokoning R, Howe A, Brown CT (2014) These are not the k-mers you are looking for: efficient online k-mer counting using a probabilistic data structure. PLoS One 9:7
Roy RS, Bhattacharya D, Schliep A (2014) Turtle: identifying frequent k-mers with cache-efficient algorithms. Bioinformatics 30(14):1950–1957
Perez N, Gutierrez M, Vera N (2016) Computational performance assessment of k-mer counting algorithms. J Comput Biol 23(4):248–255
Pan T, Flick P, Jain C, Liu Y, Aluru S (2017) Kmerind: a flexible parallel library for k-mer indexing of biological sequences on distributed memory systems. IEEE/ACM Trans Comput B
Gao T, Guo Y, Wei Y, Wang B, Lu Y, Cicotti P, Balaji P, Taufer M (2017) Bloomfish: a highly scalable distributed k-mer counting framework. In: ICPADS IEEE international conference on parallel and distributed systems, IEEE. Shenzhen, China: IEEE. [Online]. http://www.futurenet.ac.cn/icpads2017/?program-Gid_33.html
Georganas E, Buluç A, Chapman J, Hofmeyr S, Aluru C, Egan R, Oliker L, Rokhsar D, Yelick K (2015) Hipmer: an extreme-scale de novo genome assembler. In: Proceedings of the international conference for high performance computing. ACM, networking, storage and analysis, p 14
Georganas E, Buluç A, Chapman J, Oliker L, Rokhsar D, Yelick K (2014) Parallel de bruijn graph construction and traversal for de novo genome assembly. In: Proceedings of the international conference for high performance computing, networking, storage and analysis. IEEE Press, pp 437–448
Gao T, Guo Y, Zhang B, Cicotti P, Lu Y, Balaji P, Taufer M (2017) Mimir: memory-efficient and scalable mapreduce for large supercomputing systems. In: Parallel and distributed processing symposium (IPDPS), IEEE international. IEEE 2017, pp 1098–1108
Blustein J, El-Maazawi A (2002) Bloom filters: a tutorial, analysis, and survey. Dalhousie University, Halifax, pp 1–31
Roberts M, Hayes W, Hunt BR, Mount SM, Yorke JA (2004) Reducing storage requirements for biological sequence comparison. Bioinformatics 20(18):3363–3369
Cormode G, Muthukrishnan S (2005) An improved data stream summary: the count-min sketch and its applications. J Algorithms 55(1):58–75
Unified parallel c. http://upc.lbl.gov/
Acknowledgements
This work is supported by the National Key Research and Development Program of China under Grant nos. 2016YFB0201305 and 2018YFB0204403; National Science Foundation of China under Grant nos. U1435215 and 61433012; the Shenzhen Basic Research Fund under Grant nos. JCYJ20160331190123578, JCYJ20170413093358429, and GGFW2017073114031767; Chinese Academy of Sciences Grant under no. 2019VBA0009. We would also like to thank the funding support by the Shenzhen Discipline Construction Project for Urban Computing and Data Intelligence, Youth Innovation Promotion Association, and CAS to Yanjie Wei.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Ge, J., Meng, J., Guo, N. et al. Counting Kmers for Biological Sequences at Large Scale. Interdiscip Sci Comput Life Sci 12, 99–108 (2020). https://doi.org/10.1007/s12539-019-00348-5
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12539-019-00348-5