Abstract
K-mer can be used for the description of biological sequences and k-mer distribution is a tool for solving sequences analysis problems in bioinformatics. We can use k-mer vector as a representation method of the k-mer distribution of the biological sequence. Problems, such as similarity calculations or sequence assembly, can be described in the k-mer vector space. It helps us to identify new features of an old sequence-based problem in bioinformatics and develop new algorithms using the concepts and methods from linear space theory. In this study, we defined the k-mer vector space for the generalized biological sequences. The meaning of corresponding vector operations is explained in the biological context. We presented the vector/matrix form of several widely seen sequence-based problems, including read quantification, sequence assembly, and pattern detection problem. Its advantages and disadvantages are discussed. Also, we implement a tool for the sequence assembly problem based on the concepts of k-mer vector methods. It shows the practicability and convenience of this algorithm design strategy.
Article PDF
Similar content being viewed by others
Avoid common mistakes on your manuscript.
Change history
22 June 2021
An Erratum to this paper has been published: https://doi.org/10.1007/s11766-021-4472-4
References
El Mustapha Bahassi, Peter J Stambrook. Next-generation sequencing technologies: breaking the sound barrier of human genetics, Mutagenesis, 2014, 29(5): 303–310.
Rob Patro, Stephen M Mount, Carl Kingsford. Sailfish enables alignment-free isoform quantification from RNA-seq reads using lightweight algorithms, Nature Biotechnology, 2014, 32(5): 462.
Xin Bai, Kujin Tang, Jie Ren, Michael Waterman, Fengzhu Sun. Optimal choice of word length when comparing two Markov sequences using a x2-statistic, BMC Genomics, 2017, 18(6): 732.
Nafiseh Jafarzadeh, Ali Iranmanesh. C-curve: A novel 3d graphical representation of DNA sequence based on codons, Mathematical Biosciences, 2013, 241(2): 217–224.
B D Pickett, J B Miller, P G Ridge. Kmer-SSR: A Fast and Exhaustive SSR Search Algorithm, Bioinformatics, 2017, 219(24): 178.
Heng Li. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM, arX-iv:1303.3997 [q-bio], 2013, arXiv: 1303.3997.
Shuyan Ding, Qi Dai, Hongmei Liu, Tianming Wang. A simple feature representation vector for phylogenetic analysis of DNA sequences, Journal of Theoretical Biology, 2010, 265(4): 618–623.
Mihai Pop, Steven L Salzberg. Bioinformatics challenges of new sequencing technology, Trends in Genetics, 2008, 24(3): 142–149.
Subhram Das, Tamal Deb, Nilanjan Dey, Amira S Ashour, D K Bhattacharya, D N Tibarewala. Optimal choice of k-mer in composition vector method for genome sequence comparison, Genomics, 2018, 110(5): 263–273.
Jonathan D Wren, David Johnson, Le Gruenwald. Automating Genomic Data Mining via a Sequence-based Matrix Format and Associative Rule Set, BMC Bioinformatics, 2005, 6(2): S2.
Sebastian Deorowicz, Marek Kokot, Szymon Grabowski, Agnieszka Debudaj-Grabysz. KMC 2: fast and resource-frugal k-mer counting, Bioinformatics, 2015, 31(10): 1569–1576.
Nicolas Bray, Harold Pimentel, Pll Melsted, Lior Pachter. Near-optimal RNA-Seq quantification, arXiv:1505.02710, 2015.
Daniel R Zerbino, Ewan Birney. Velvet: Algorithms for de novo short read assembly using de Bruijn graphs, Genome Research, 2008, 18(5): 821–829.
Aleksey V Zimin, Guillaume Marais, Daniela Puiu, Michael Roberts, Steven L Salzberg, James A Yorke. The MaSuRCA genome assembler, Bioinformatics, 2013, 29(21): 2669–2677.
Ben Langmead, Cole Trapnell, Mihai Pop, Steven L Salzberg. Ultrafast and memoryefficient alignment of short DNA sequences to the human genome, Genome Biology, 2009, 10(3): R25.
Giuseppe Lancia. Mathematical Programming in Computational Biology: an Annotated Bibliography, Algorithms, 2008, 1(2): 100–129.
Marais G, Kingsford C. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers, Bioinformatics (Oxford, England), 2011, 27(6): 764.
Slatko Be, Gardner Af, Ausubel Fm. Overview of Next-Generation Sequencing Technologies, Current Protocols in Molecular Biology, 2018,122(1): e59–e59.
Ping-an He, Dan Li, Yanping Zhang, Xin Wang, Yuhua Yao. A 3d graphical representation of protein sequences based on the Gray code, Journal of Theoretical Biology, 2012, 304: 8–87.
Bin Fu, Yunhui Fu, Yuan Xue. Sublinear Time Motif Discovery from Multiple Sequences, Algorithms, 2013, 6(4): 636–677.
Jia Wen, YuYan Zhang, Stephen S T Yau. k-mer Sparse matrix model for genetic sequence and its applications in sequence comparison, Journal of Theoretical Biology, 2014, 363: 145–150.
Yao-Ting Huang, Chen-Fu Liao. Integration of string and de Bruijn graphs for genome assembly, Bioinformatics, 2016, 32(9): 1301–1307.
Jinyu Yang, Anjun Ma, Adam D Hoppe, Cankun Wang, Yang Li, Chi Zhang, Yan Wang, Bingqiang Liu, Qin Ma. Prediction of regulatory motifs from human Chip-sequencing data using a deep learning framework, Nucleic Acids Research, 2019, 47(15): 7809–7824.
Z H You, J Li, X Gao, Z He, L Zhu, Y K Lei, Z Ji. Detecting protein-protein interactions with a novel matrix-based protein sequence representation and support vector machines, BioMed research international, 2015, 2015: 867516–867516.
Author information
Authors and Affiliations
Corresponding author
Additional information
Supported by the National Natural Science Foundation of China(11771393,11632015) and the Natural Science Foundation of Zhejiang Province, China (LZ14A010002).
The original version of this article was revised due to a retrospective Open Access order.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, duplication, adaptation, distribution, and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provided a link to the Creative Commons license, and indicate if changes were made.
About this article
Cite this article
Liu, Wl., Wu, Qb. Analysis method and algorithm design of biological sequence problem based on generalized k-mer vector. Appl. Math. J. Chin. Univ. 36, 114–127 (2021). https://doi.org/10.1007/s11766-021-4033-x
Received:
Revised:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11766-021-4033-x