Counting Kmers for Biological Sequences at Large Scale

Ge, Jianqiu; Meng, Jintao; Guo, Ning; Wei, Yanjie; Balaji, Pavan; Feng, Shengzhong

doi:10.1007/s12539-019-00348-5

Counting Kmers for Biological Sequences at Large Scale

Original research article
Published: 16 November 2019

Volume 12, pages 99–108, (2020)
Cite this article

Interdisciplinary Sciences: Computational Life Sciences Aims and scope Submit manuscript

Jianqiu Ge¹^na1,
Jintao Meng¹^na1,
Ning Guo¹,
Yanjie Wei¹,
Pavan Balaji² &
…
Shengzhong Feng¹

942 Accesses
5 Citations
10 Altmetric
1 Mention
Explore all metrics

Abstract

Counting the abundance of all the distinct kmers in biological sequence data is a fundamental step in bioinformatics. These applications include de novo genome assembly, error correction, etc. With the development of sequencing technology, the sequence data in a single project can reach Petabyte-scale or Terabyte-scale nucleotides. Counting demand for the abundance of these sequencing data is beyond the memory and computing capacity of single computing node, and how to process it efficiently is a challenge on a high-performance computing cluster. As such, we propose SWAPCounter, a highly scalable distributed approach for kmer counting. This approach is embedded with an MPI streaming I/O module for loading huge data set at high speed, and a counting bloom filter module for both memory and communication efficiency. By overlapping all the counting steps, SWAPCounter achieves high scalability with high parallel efficiency. The experimental results indicate that SWAPCounter has competitive performance with two other tools on shared memory environment, KMC2, and MSPKmerCounter. Moreover, SWAPCounter also shows the highest scalability under strong scaling experiments. In our experiment on Cetus supercomputer, SWAPCounter scales to 32,768 cores with 79% parallel efficiency (using 2048 cores as baseline) when processing 4 TB sequence data of 1000 Genomes. The source code of SWAPCounter is publicly available at https://github.com/mengjintao/SWAPCounter.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

K-mer Counting for Genomic Big Data

TopKmer: Parallel High Frequency K-mer Counting on Distributed Memory

Gerbil: a fast and memory-efficient k-mer counter with GPU-support

Article Open access 31 March 2017

Marius Erbert, Steffen Rechner & Matthias Müller-Hannemann

References

Zou Q, Li X, Jiang W, Lin Z, Li G, Chen K (2014) Survey of mapreduce frame operation in bioinformatics. Brief Bioinform 15(4):637–647
Article Google Scholar
Guo R, Zhao Y, Zou Q, Fang X, Peng S (2018) Bioinformatics applications on apache spark. GigaScience 7(8):giy098
PubMed Central Google Scholar
Miller JR, Koren S, Sutton GG (2010) Assembly algorithms for next-generation sequencing data. Genomics 95(6):315–327
Article CAS Google Scholar
Pevzner PA, Tang H, Waterman MS (2001) An eulerian path approach to DNA fragment assembly. Proc Nat Acad Sci 98(17):9748–9753
Article CAS Google Scholar
Meng J, Wang B, Wei Y, Feng S, Balaji P (2014) Swap-assembler: scalable and efficient genome assembly towards thousands of cores. BMC Bioinform BioMed Central 15(9):S2
Article Google Scholar
Meng J, Seo S, Balaji P, Wei Y, Wang B, Feng S (2016) Swap-assembler 2: optimization of de novo genome assembler at extreme scale. In: Parallel processing (ICPP), 2016 45th international conference on. IEEE, pp 195–204
Simpson JT, Wong K, Jackman SD, Schein JE, Jones SJ, Birol I (2009) Abyss: a parallel assembler for short read sequence data. Genome Res 19(6):1117–1123
Article CAS Google Scholar
Kelley DR, Schatz MC, Salzberg SL (2010) Quake: quality-aware detection and correction of sequencing errors. Genome Biol 11(11):R116
Article CAS Google Scholar
Liu Y, Schröder J, Schmidt B (2012) Musket: a multistage k-mer spectrum-based error corrector for illumina sequence data. Bioinformatics 29(3):308–315
Article Google Scholar
Sheikhizadeh S, De Ridder D (2015) Ace: accurate correction of errors using k-mer tries. Bioinformatics 31(19):3216–3218
Article CAS Google Scholar
Medvedev P, Scott E, Kakaradov B, Pevzner P (2011) Error correction of high-throughput sequencing datasets with non-uniform coverage. Bioinformatics 27(13):i137–i141
Article CAS Google Scholar
Kent WJ (2002) Blat-the blast-like alignment tool. Genome Res 12(4):656–664
Article CAS Google Scholar
Qin J, Li R, Raes J, Arumugam M, Burgdorf KS, Manichanh C, Nielsen T, Pons N, Levenez F, Yamada T et al (2010) A human gut microbial gene catalogue established by metagenomic sequencing. Nature 464(7285):59
Article CAS Google Scholar
Marçais G, Kingsford C (2011) A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics 27(6):764–770
Article Google Scholar
Li Y et al (2015) Mspkmercounter: a fast and memory efficient approach for k-mer counting. arXiv:1505.06550 (arXiv preprint)
Li Y, Kamousi P, Han F, Yang S, Yan X, Suri S (2013) Memory efficient minimum substring partitioning. Very Large Data Bases 6(3):169–180
Google Scholar
Melsted P, Pritchard JK (2011) Efficient counting of k-mers in dna sequences using a bloom filter. BMC Bioinform 12(1):333
Article CAS Google Scholar
Deorowicz S, Kokot M, Grabowski S, Debudaj-Grabysz A (2015) Kmc 2: fast and resource-frugal k-mer counting. Bioinformatics 31(10):1569–1576
Article CAS Google Scholar
Rizk G, Lavenier D, Chikhi R (2013) Dsk: k-mer counting with very low memory usage. Bioinformatics 29(5):652–653
Article CAS Google Scholar
Zhang Q, Pell J, Caninokoning R, Howe A, Brown CT (2014) These are not the k-mers you are looking for: efficient online k-mer counting using a probabilistic data structure. PLoS One 9:7
Google Scholar
Roy RS, Bhattacharya D, Schliep A (2014) Turtle: identifying frequent k-mers with cache-efficient algorithms. Bioinformatics 30(14):1950–1957
Article CAS Google Scholar
Perez N, Gutierrez M, Vera N (2016) Computational performance assessment of k-mer counting algorithms. J Comput Biol 23(4):248–255
Article CAS Google Scholar
Pan T, Flick P, Jain C, Liu Y, Aluru S (2017) Kmerind: a flexible parallel library for k-mer indexing of biological sequences on distributed memory systems. IEEE/ACM Trans Comput B
Gao T, Guo Y, Wei Y, Wang B, Lu Y, Cicotti P, Balaji P, Taufer M (2017) Bloomfish: a highly scalable distributed k-mer counting framework. In: ICPADS IEEE international conference on parallel and distributed systems, IEEE. Shenzhen, China: IEEE. [Online]. http://www.futurenet.ac.cn/icpads2017/?program-Gid_33.html
Georganas E, Buluç A, Chapman J, Hofmeyr S, Aluru C, Egan R, Oliker L, Rokhsar D, Yelick K (2015) Hipmer: an extreme-scale de novo genome assembler. In: Proceedings of the international conference for high performance computing. ACM, networking, storage and analysis, p 14
Georganas E, Buluç A, Chapman J, Oliker L, Rokhsar D, Yelick K (2014) Parallel de bruijn graph construction and traversal for de novo genome assembly. In: Proceedings of the international conference for high performance computing, networking, storage and analysis. IEEE Press, pp 437–448
Gao T, Guo Y, Zhang B, Cicotti P, Lu Y, Balaji P, Taufer M (2017) Mimir: memory-efficient and scalable mapreduce for large supercomputing systems. In: Parallel and distributed processing symposium (IPDPS), IEEE international. IEEE 2017, pp 1098–1108
Blustein J, El-Maazawi A (2002) Bloom filters: a tutorial, analysis, and survey. Dalhousie University, Halifax, pp 1–31
Google Scholar
http://llimllib.github.io/bloomfilter-tutorial/
Roberts M, Hayes W, Hunt BR, Mount SM, Yorke JA (2004) Reducing storage requirements for biological sequence comparison. Bioinformatics 20(18):3363–3369
Article CAS Google Scholar
Cormode G, Muthukrishnan S (2005) An improved data stream summary: the count-min sketch and its applications. J Algorithms 55(1):58–75
Article Google Scholar
Unified parallel c. http://upc.lbl.gov/

Download references

Acknowledgements

This work is supported by the National Key Research and Development Program of China under Grant nos. 2016YFB0201305 and 2018YFB0204403; National Science Foundation of China under Grant nos. U1435215 and 61433012; the Shenzhen Basic Research Fund under Grant nos. JCYJ20160331190123578, JCYJ20170413093358429, and GGFW2017073114031767; Chinese Academy of Sciences Grant under no. 2019VBA0009. We would also like to thank the funding support by the Shenzhen Discipline Construction Project for Urban Computing and Data Intelligence, Youth Innovation Promotion Association, and CAS to Yanjie Wei.

Author information

Jianqiu Ge and Jintao Meng contributed equally.

Authors and Affiliations

Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Beijing, 518055, China
Jianqiu Ge, Jintao Meng, Ning Guo, Yanjie Wei & Shengzhong Feng
Mathematics and Computer Science Division, Argonne National Laboratory, Lemont, IL, 60439-4844, USA
Pavan Balaji

Authors

Jianqiu Ge
View author publications
You can also search for this author in PubMed Google Scholar
Jintao Meng
View author publications
You can also search for this author in PubMed Google Scholar
Ning Guo
View author publications
You can also search for this author in PubMed Google Scholar
Yanjie Wei
View author publications
You can also search for this author in PubMed Google Scholar
Pavan Balaji
View author publications
You can also search for this author in PubMed Google Scholar
Shengzhong Feng
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yanjie Wei.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ge, J., Meng, J., Guo, N. et al. Counting Kmers for Biological Sequences at Large Scale. Interdiscip Sci Comput Life Sci 12, 99–108 (2020). https://doi.org/10.1007/s12539-019-00348-5

Download citation

Received: 08 July 2019
Revised: 19 August 2019
Accepted: 25 October 2019
Published: 16 November 2019
Issue Date: March 2020
DOI: https://doi.org/10.1007/s12539-019-00348-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Counting Kmers for Biological Sequences at Large Scale

Abstract

Access this article

Similar content being viewed by others

K-mer Counting for Genomic Big Data

TopKmer: Parallel High Frequency K-mer Counting on Distributed Memory

Gerbil: a fast and memory-efficient k-mer counter with GPU-support

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Counting Kmers for Biological Sequences at Large Scale

Abstract

Access this article

Similar content being viewed by others

K-mer Counting for Genomic Big Data

TopKmer: Parallel High Frequency K-mer Counting on Distributed Memory

Gerbil: a fast and memory-efficient k-mer counter with GPU-support

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation