Analysis method and algorithm design of biological sequence problem based on generalized k-mer vector

Liu, Wen-li; Wu, Qing-biao

doi:10.1007/s11766-021-4033-x

Analysis method and algorithm design of biological sequence problem based on generalized k-mer vector

Open access
Published: 10 March 2021

Volume 36, pages 114–127, (2021)
Cite this article

Download PDF

You have full access to this open access article

Applied Mathematics-A Journal of Chinese Universities Aims and scope Submit manuscript

Analysis method and algorithm design of biological sequence problem based on generalized k-mer vector

Download PDF

Wen-li Liu^1,2 &
Qing-biao Wu¹

256 Accesses
2 Citations
Explore all metrics

A Correction to this article was published on 22 June 2021

This article has been updated

Abstract

K-mer can be used for the description of biological sequences and k-mer distribution is a tool for solving sequences analysis problems in bioinformatics. We can use k-mer vector as a representation method of the k-mer distribution of the biological sequence. Problems, such as similarity calculations or sequence assembly, can be described in the k-mer vector space. It helps us to identify new features of an old sequence-based problem in bioinformatics and develop new algorithms using the concepts and methods from linear space theory. In this study, we defined the k-mer vector space for the generalized biological sequences. The meaning of corresponding vector operations is explained in the biological context. We presented the vector/matrix form of several widely seen sequence-based problems, including read quantification, sequence assembly, and pattern detection problem. Its advantages and disadvantages are discussed. Also, we implement a tool for the sequence assembly problem based on the concepts of k-mer vector methods. It shows the practicability and convenience of this algorithm design strategy.

Article PDF

GenomeTester4: a toolkit for performing basic set operations - union, intersection and complement on k-mer lists

Article Open access 03 December 2015

Mining K-mers of Various Lengths in Biological Sequences

Sequence Similarity Alignment Algorithm in Bioinformatics: Techniques and Challenges

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Change history

22 June 2021
An Erratum to this paper has been published: https://doi.org/10.1007/s11766-021-4472-4

References

El Mustapha Bahassi, Peter J Stambrook. Next-generation sequencing technologies: breaking the sound barrier of human genetics, Mutagenesis, 2014, 29(5): 303–310.
Article Google Scholar
Rob Patro, Stephen M Mount, Carl Kingsford. Sailfish enables alignment-free isoform quantification from RNA-seq reads using lightweight algorithms, Nature Biotechnology, 2014, 32(5): 462.
Article Google Scholar
Xin Bai, Kujin Tang, Jie Ren, Michael Waterman, Fengzhu Sun. Optimal choice of word length when comparing two Markov sequences using a x2-statistic, BMC Genomics, 2017, 18(6): 732.
Article Google Scholar
Nafiseh Jafarzadeh, Ali Iranmanesh. C-curve: A novel 3d graphical representation of DNA sequence based on codons, Mathematical Biosciences, 2013, 241(2): 217–224.
Article MathSciNet Google Scholar
B D Pickett, J B Miller, P G Ridge. Kmer-SSR: A Fast and Exhaustive SSR Search Algorithm, Bioinformatics, 2017, 219(24): 178.
Google Scholar
Heng Li. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM, arX-iv:1303.3997 [q-bio], 2013, arXiv: 1303.3997.
Shuyan Ding, Qi Dai, Hongmei Liu, Tianming Wang. A simple feature representation vector for phylogenetic analysis of DNA sequences, Journal of Theoretical Biology, 2010, 265(4): 618–623.
Article MathSciNet Google Scholar
Mihai Pop, Steven L Salzberg. Bioinformatics challenges of new sequencing technology, Trends in Genetics, 2008, 24(3): 142–149.
Article Google Scholar
Subhram Das, Tamal Deb, Nilanjan Dey, Amira S Ashour, D K Bhattacharya, D N Tibarewala. Optimal choice of k-mer in composition vector method for genome sequence comparison, Genomics, 2018, 110(5): 263–273.
Article Google Scholar
Jonathan D Wren, David Johnson, Le Gruenwald. Automating Genomic Data Mining via a Sequence-based Matrix Format and Associative Rule Set, BMC Bioinformatics, 2005, 6(2): S2.
Article Google Scholar
Sebastian Deorowicz, Marek Kokot, Szymon Grabowski, Agnieszka Debudaj-Grabysz. KMC 2: fast and resource-frugal k-mer counting, Bioinformatics, 2015, 31(10): 1569–1576.
Article Google Scholar
Nicolas Bray, Harold Pimentel, Pll Melsted, Lior Pachter. Near-optimal RNA-Seq quantification, arXiv:1505.02710, 2015.
Daniel R Zerbino, Ewan Birney. Velvet: Algorithms for de novo short read assembly using de Bruijn graphs, Genome Research, 2008, 18(5): 821–829.
Article Google Scholar
Aleksey V Zimin, Guillaume Marais, Daniela Puiu, Michael Roberts, Steven L Salzberg, James A Yorke. The MaSuRCA genome assembler, Bioinformatics, 2013, 29(21): 2669–2677.
Article Google Scholar
Ben Langmead, Cole Trapnell, Mihai Pop, Steven L Salzberg. Ultrafast and memoryefficient alignment of short DNA sequences to the human genome, Genome Biology, 2009, 10(3): R25.
Article Google Scholar
Giuseppe Lancia. Mathematical Programming in Computational Biology: an Annotated Bibliography, Algorithms, 2008, 1(2): 100–129.
Article MathSciNet Google Scholar
Marais G, Kingsford C. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers, Bioinformatics (Oxford, England), 2011, 27(6): 764.
Article Google Scholar
Slatko Be, Gardner Af, Ausubel Fm. Overview of Next-Generation Sequencing Technologies, Current Protocols in Molecular Biology, 2018,122(1): e59–e59.
Google Scholar
Ping-an He, Dan Li, Yanping Zhang, Xin Wang, Yuhua Yao. A 3d graphical representation of protein sequences based on the Gray code, Journal of Theoretical Biology, 2012, 304: 8–87.
Article MathSciNet Google Scholar
Bin Fu, Yunhui Fu, Yuan Xue. Sublinear Time Motif Discovery from Multiple Sequences, Algorithms, 2013, 6(4): 636–677.
Article MathSciNet Google Scholar
Jia Wen, YuYan Zhang, Stephen S T Yau. k-mer Sparse matrix model for genetic sequence and its applications in sequence comparison, Journal of Theoretical Biology, 2014, 363: 145–150.
Article Google Scholar
Yao-Ting Huang, Chen-Fu Liao. Integration of string and de Bruijn graphs for genome assembly, Bioinformatics, 2016, 32(9): 1301–1307.
Article Google Scholar
Jinyu Yang, Anjun Ma, Adam D Hoppe, Cankun Wang, Yang Li, Chi Zhang, Yan Wang, Bingqiang Liu, Qin Ma. Prediction of regulatory motifs from human Chip-sequencing data using a deep learning framework, Nucleic Acids Research, 2019, 47(15): 7809–7824.
Article Google Scholar
Z H You, J Li, X Gao, Z He, L Zhu, Y K Lei, Z Ji. Detecting protein-protein interactions with a novel matrix-based protein sequence representation and support vector machines, BioMed research international, 2015, 2015: 867516–867516.
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Mathematics, Zhejiang University, Hangzhou, 310027, China
Wen-li Liu & Qing-biao Wu
Zhejiang Provincial Key Laboratory of Horticultural Plant Integrative Biology, Zhejiang University, Zijingang Campus, Hangzhou, 310012, China
Wen-li Liu

Authors

Wen-li Liu
View author publications
You can also search for this author in PubMed Google Scholar
Qing-biao Wu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Qing-biao Wu.

Additional information

Supported by the National Natural Science Foundation of China(11771393,11632015) and the Natural Science Foundation of Zhejiang Province, China (LZ14A010002).

The original version of this article was revised due to a retrospective Open Access order.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, duplication, adaptation, distribution, and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provided a link to the Creative Commons license, and indicate if changes were made.

Reprints and permissions

About this article

Cite this article

Liu, Wl., Wu, Qb. Analysis method and algorithm design of biological sequence problem based on generalized k-mer vector. Appl. Math. J. Chin. Univ. 36, 114–127 (2021). https://doi.org/10.1007/s11766-021-4033-x

Download citation

Received: 05 February 2020
Revised: 03 April 2020
Published: 10 March 2021
Issue Date: March 2021
DOI: https://doi.org/10.1007/s11766-021-4033-x

MR Subject Classification

92B05

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Analysis method and algorithm design of biological sequence problem based on generalized k-mer vector

Abstract

Article PDF

Similar content being viewed by others

GenomeTester4: a toolkit for performing basic set operations - union, intersection and complement on k-mer lists

Mining K-mers of Various Lengths in Biological Sequences

Sequence Similarity Alignment Algorithm in Bioinformatics: Techniques and Challenges

Change history

22 June 2021

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

MR Subject Classification

Keywords

Navigation

Analysis method and algorithm design of biological sequence problem based on generalized k-mer vector

Abstract

Article PDF

Similar content being viewed by others

GenomeTester4: a toolkit for performing basic set operations - union, intersection and complement on k-mer lists

Mining K-mers of Various Lengths in Biological Sequences

Sequence Similarity Alignment Algorithm in Bioinformatics: Techniques and Challenges

Change history

22 June 2021

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

MR Subject Classification

Keywords

Search

Navigation