Analysis method and algorithm design of biological sequence problem based on generalized k-mer vector

K-mer can be used for the description of biological sequences and k-mer distribution is a tool for solving sequences analysis problems in bioinformatics. We can use k-mer vector as a representation method of the k-mer distribution of the biological sequence. Problems, such as similarity calculations or sequence assembly, can be described in the k-mer vector space. It helps us to identify new features of an old sequence-based problem in bioinformatics and develop new algorithms using the concepts and methods from linear space theory. In this study, we defined the k-mer vector space for the generalized biological sequences. The meaning of corresponding vector operations is explained in the biological context. We presented the vector/matrix form of several widely seen sequence-based problems, including read quantification, sequence assembly, and pattern detection problem. Its advantages and disadvantages are discussed. Also, we implement a tool for the sequence assembly problem based on the concepts of k-mer vector methods. It shows the practicability and convenience of this algorithm design strategy.


§1 Introduction
Sequence analysis is a fundamental problem in the bioinformatics area [1,2,3].Developing algorithms for the sequence analysis problems is a hot topic due to the rapid updating speed of sequencing technology [4].
Except for the most common string representation of biological sequences, other sequence forms also exist.For example, the graphical representation of sequences with geometric features [5,6].As for the vector and matrix representation methods, researchers have given us some excellent results.In the work of Ding et al., the frequency of short k-words (length≤7) was used to give a simple feature representation vector for DNA sequences [7].In the work of Wren et al., they built a matrix by extracting s set of position-dependent features to represent a DNA sequence [8].Wen et al. used a Boolean logic value for the existence of each k-mer to build a matrix [9].There are also works on the protein sequences as well [10].However, these methods are incompatible with each other which makes the vector method of biological sequences limited when being extended to new problems.A more general definition is required.
Biological sequence data and sequence-based problems have some features in common.The large size of the data set has increased the problem complexity.Massive information redundancy exists in many cases while some are not necessary.The sequence length, read depth of coverage and the sequencing accuracy are some common variables seen within the analysis problems.And the calculation of sequence similarity is a key node in the network of the sequence-based problems.These bioinformatics features must be linked to the concepts in linear algebra and matrix theory when constructing the vector spaces.Then a myriad of methods and research findings from linear algebra and computational mathematics can be applied to the biological sequence-based problems in their matrix form.
K-mer is a frequently-used concept in solving sequence-based problems in bioinformatics.K-mers are short subsequences of a biology sequence.Using short subsequences can reduce the complexity of a problem by only considering part of the information.Researchers have given us a lot of analysis tools using k-mer as the key element in algorithm design, such as Kallisto [11] and Sailfish [12] while the k-mer length ranges from a few bps (base pair) to dozens of bps in different research work.Especially the sequencing reads quantification tool, kallisto.It has a highly efficient alignment-free algorithm by using the counts of k-mer, showing the ability of k-mer in solving a sequence-based problem.Also, there are some efficient methods for counting k-mers ( [13] , KMC2 [14] and Kmer-SSR [15]) and discussions on the suitable k-mer length [16,17].These conditions make the k-mer a priority option for building solutions to sequence-based problems.
In this study, we defined the k-mer vector for the DNA sequence and sequence set.We constructed a vector space of the sequence vectors and defined the operations over it.Then we presented three basic sequence-based bioinformatics problems in their matrix form, including read quantification, sequence assembly problem, and sequence pattern detection.Moreover, we designed an algorithm and implemented a tool for the sequence assembly problem using the k-mer vector method.It is an instance of applying the concepts of sequence vector space and vector operations in solving a practical problem.§2 The k-mer vector space for biological sequence Biological sequences are sequences with limited alphabet size.In this work, we are mainly dealing with the DNA sequence ({A, G, C, T}) as a simplified model for discussion.

k-mer vector generated from a biological sequence
Let s be a DNA sequence of length l s .So s is a string of length l s over the alphabet Σ = {A, G, C, T }.Let the k-mer set of s with defined k-mer length k be where k i is the subsequence of s starting from location i.We have Then a set of sequences can be denoted as S = {s i |i ∈ {1, 2, 3, • • • , n S }}, where s i stands for one sequence and n S denotes the number of sequences in set S.
The total number of all possible k-mers with defined k-mer length k is 4 k .If we sort all the k-mers with the dictionary order, which is (k We call ⃗ v s the k-mer vector of s.Thus a set S with n S sequences can be denoted by a matrix: In many sequence-based bioinformatics problems, there is a concept of read depth of coverage over each position along the sequence.When we introduce depth into the definition of k-mer vector, we have is the depth of coverage on position i along sequence s and d ′ i ∈ R. A negative value in the vector can be similarly given a practical meaning as well.It can stand for the unsupported degree of a position.Thus the definition of the matrix for sequence set S can be broadened to Using the above definition, any sequence can be transformed into a k-mer vector.A k-mer vector can represent one read in a sequence set S or the whole set S by [ We can treat a single sequence as a special case of sequence set which has size 1.So the k-mer vector is describing the k-mer set of a sequence set.Two identical sequences will have the same k-mer vector while the reverse does not hold.But in practice, the possibility that two different sequences have the same k-mer vector is small.

Generalized k-mer vector and the properties of k-mer vector space
A k-mer vector is defined as , where v i ∈ R. It stands for a generalized biological sequence s.The concept of sequences has been extended here: not limited to continuous strings.So the sequence s is represented by its k-mer set which is described by the k-mer vector.From this definition of k-mer vector, we can generate a k-mer vector space.
Let V be the set of all possible k-mer vectors which is nonempty and R the real number set.We use the addition and multiplication between real numbers to define the operation on V .Define the addition operation between It can be seen as the merging of two sequence sets which result in a new set.Define the multiple operation between The multiplication is the scaling of the k-mer vector for a sequence set.We have Theorem 1. V is a linear space on R.
Proof of Theorem 1.For the defined vector space V and field R with two operations, the following conditions are satisfied: So V is closed for the linear operations.
Using the linear operations over the k-mer vector space V , we can merge the sequencing data from different samples into one, get the difference between two sequence sets, or do the normalization for multiple data.Now introduce the definition of the inner product on the k-mer space.The set of all possible k-mer vectors, V , is a linear space over From the geometry, the inner product ⟨⃗ v, ⃗ v ′ ⟩ gives us the production of the projection of one vector on the direction of the other vector.It introduces the concept of included angle in the k-mer vector space.

The length of k-mer vector
where k is the k-mer length.For a sequence with uniform distribution of depth, it reflects the sequence length.Since the total number of kmer is It is the principle of genome size (L) estimation using the k-mer distribution.We also have the Cauchy-Schwarz inequation |⟨⃗ v, ⃗ v ′ ⟩| ≤ ||⃗ v||||⃗ v ′ ||.The equality holds when ⃗ v, ⃗ v ′ have a linear dependence relation.It means that these two sequences are composed of the same k-mers.They can be two identical sequences with different read depths.The distance between k-mer vectors is defined as The included angle of k-mer vector ⃗ v and gives us the cosine similarity between two k-mer vectors which stands for two sequences.It shows how close the directions of two sequences are in the k-mer vector space.For a sequence set S in (2), SS T gives us a symmetric matrix with each element the inner product of two vectors.We calculate it by Denote expression (4) as T S = SS T and expression (5) as T sim .T sim is the cosine similarity matrix of the sequence set S. In some practical problems, for example the de novo transcriptome assembly of RNA-Seq reads where the reads have identical read length, using T S /length is equivalent to T sim since all the vectors have the same length.
The columns of S show the distribution of each distinct k-mer in the sequence set.The row rank of S shows the maximum number of linearly independent sequences.And since the row rank equals the column rank, if the n S sequences in S are linearly independent, k selected k-mers will be sufficient to distinguish them.It is a theoretical support for using part of the k-mers in RNA-Seq quantification in Sailfish [12].§3 The matrix forms of sequence-based problems In this section, we will describe three sequence-based problems in their k-mer vector/matrix forms.It offers us a new angle to look at these problems and helps us to build new analysis algorithms.

Quantification of short reads to reference sequences
In some problems, such as RNA-seq analysis, one of the main steps is aligning the short reads back to the reference sequence and get the quantification result.
Let the matrix of reference sequences be S r , and the matrix of short reads be S.So the total aligned reads number of each sequence in S r can be obtained the following from.
When a reference sequence is highly expressed in the RNA-seq problem, the duplicated short reads will result in a high module value of the corresponding k-mer vector.The sum of each column in SS T is correlated with the total number of the short reads belong to each reference sequence: The widely used RPKM value is defined as RPKM= total exon reads/ (mapped reads (Millions) * exon length(KB)).It counts the reads over each sequence while some other methods use the number of fragments over each sequence.Short subsequence with fixed length can also be used in quantification [9].From c t in (8), we can get an approximation of the RPKM value of Further adjustment can be applied when some necessary assumptions were made in the analysis.

Sequence assembly problem
A sequence assembly problem is a clustering problem of the input sequence set based on the similarities.Using the similarity matrix T sim (5), we can quickly get the similarity value between any two sequences in the set.Moreover, by doing the line and column switching of the similarity matrix, we can generate a block diagonal matrix.Each block is a cluster of sequences.
Here we will discuss the de novo transcriptome assembly problem of RNA-Seq reads.To simplify the problem, assume all the vectors have the same length.So we use T S as the similarity matrix for discussion.For a set of sequenced reads, let the matrix of k-mer vectors be S.Then, we have For where m is the number of the diagonal blocks, it can generate an assembled unique sequence.The vectors clustered into a same block T (i,i) S have a high probability to have the same origin.To get a more accurate clustering result, we can set a cutoff of the similarity value in the similarity matrix.For T S = SS T , i, j ∈ {1, 2, • • • , n S }, l the read length and p sim the predefined similarity cutoff.
In T S , take the first line ⃗ t 1 s for example, ⃗ t 1 s stands for the similarity between ⃗ s 1 and ⃗ s i , i ∈ 1, 2, • • • , n S .Thus ⃗ t 1 s shows us the correlated vectors in S of ⃗ s i whose value is greater than 0 (or the cutoff p sim * l).These vectors form a set generated from ⃗ s 1 .Then T S * ⃗ t 1 s is shows the correlation between any vector in S and the vector cluster generated from ⃗ s 1 .T S * ⃗ t 1 s gives us a clustered group generated from ⃗ s 1 with high accuracy since it is not considering the correlation between two vectors but a vector and a cluster of vectors.

Pattern detection
The detection of a fixed pattern from a sequence set, such as the detection of motifs [18,19], is a widely seen sequence analysis problem.It can help us find a conserved domain or make the function prediction of a sequence.Here we will discuss two simplified possible situations with pattern length smaller than k-mer length k.
Let S be a set of candidate sequences.Let s p be a query pattern and l p the length of s p , l p < k.Then we have a set of 4 (k−lp) possible k-mers that contains s p .Denote it as S p .Use all these 4 (k−lp) k-mers to build a sequence vector, ⃗ v p .
So we have Any positive value in S * ⃗ v p indicates a sequence in S with a possible pattern s p .
Let s pp be the query pattern and l pp the length of s pp , l pp < k.Now suppose the first and second base of s pp is not fixed.Let P 1 (x) (P 2 (x)), x ∈ {A, G, C, T }, denote the probability of the first (second) base being A/G/C/T, respectively.Then we have C 1  4 * C 1 4 = 16 possible query patterns with occurrence possibility P 1 (x) * P 2 (y), x, y ∈ {A, G, C, T }.From each one of the 16 patterns, we can get a k-mer vector as we did in the above situation: Any positive value in expression (21) indicates a sequence in S with a possible form of pattern s pp .It is an example of solving a sequence-based problem with indeterminate bases using the k-mer vector method.For more complicated situations, the Monte Carlo method could be a choice.§4 Algorithm implementation and discussion Many methods are capable of solving the assembly problem of biological sequences [20,21,22].Most of their algorithm design strategy is based on the common string representation of sequences.Here, we will take the assembly problem in the matrix form discussed in section 3.2 as an example of the algorithm design in k-mer vector space.We will explain how the theory of k-mer vector and matrix is connected with real problems in sequence-based biological problem analysis.

Design and implementation of vector-based algorithm for assembly problem
We developed a program named iLoqu to implement the use of the similarity matrix in the k-mer space for sequence assembly.It is written in C++ and can be executed under either 32-bit or 64-bit Linux systems or windows system.To calculate the similarity matrix, we used the hashmap from C++. iLoqu contains three main steps: 1. extract the k-mers from the sequences to generate a hashmap; 2. use the hashmap to calculate the similarity matrix; 3. cluster the k-mer vectors into groups and do the assembly of each group.
iLoqu takes Fasta file as its input.The program starts with reading the sequences from the Fasta file and extracts all the k-mers to create a hashmap.The matrix of k-mer vectors generated from the sequence set is a sparse matrix.The data structure of the hashmap provides us with a feasible method to calculate the similarity matrix while reducing storage requirement.The k-mers from each sequence are added to the hashmap within a for-loop.Each k-mer is a key in the hash table, and other corresponding information of the sequence is stored in its value.Thus in the hashmap, the key is the column index of the matrix while the value contains the line information.
The second step is generating the similarity matrix using the hashmap built in the first step.While executing the algorithm, the generating of the similarity matrix is operated with step one in the same while loop of reading sequences.See algorithm-1.The similarity matrix is built from the upper-left corner to the right-bottom corner and it is a symmetric matrix.Mention that a symmetric matrix is a square matrix that is equal to its transpose: A is symmetric if and only if A = A T .Here we set a cutoff value to remove the low similarity data.Also, two vectors coming up with an extremely high production value can be merged into one in advance before the following clustering step.Here, the inner product of two k-mer vectors, ⃗ v, ⃗ v ′ , was set to 0 if it is smaller than the preset similarity cutoff.
The last step is clustering the k-mer vectors basing on the similarity matrix.After processing the inner product value by filtering with a cutoff, the line coordinate and column coordinate of any positive value in the similarity matrix gives us two vectors that could be grouped into one.For the final assembly, since the number of sequences in each group is greatly reduced compared to the whole sequence set, many assembly methods can handle this situation.Here in the iLoqu, we also presented an assembling procedure using the k-mer.On the other hand, if the assembly result is going to be used in the k-mer vector form in the downstream analysis, Algorithm 1 Read the input sequences, build hashmap and calculate the similarity 1: while read-in sequence s do 2: Array.Add(Hash seq(kmer)) similarity ← Hash seq(id)/sqrt(seq1.length* seq2.length)11: end while such as the quantification step described in section 3.1, there is no need for an actual assembly.We can generate a set of reference vectors from the block diagonal matrix result.
The assembly algorithm is described in algorithm 2. Starting from vectors with big length, we select one seed k-mer from the vector and look up the next k-mers on its two sides from the hash map.
Algorithm 2 Sequence assembly from the hashmap kmer ← Getkmer(s, i, k) if key is not used as seed yet then

Numerical example
iLoqu is a tool designed for sequence assembly based on the concept of the k-mer vector.It is suitable for the overlap-based assembly problems of RNA-Seq.As mentioned above, the assembly process from the grouped vectors to string form is not necessary if it is going to be used in the downstream sequence-based analysis in vector/matrix form.So we will not compare the assembly speed with other tools but show the assembly accuracy.We made a test on simulating data and real biology data using iLoqu software.For the real RNA-Seq data, we used the reads that can be aligned to ten selected genes by Bowtie [23].These ten genes are randomly selected from the cDNA file of Arabidopsis/Homo sapiens (file Araport11 genes.201606.cdna.fasta of Arabidopsis thaliana and the file GCF 000001405.38GRCh38.p12rna.fna of Homo sapiens) with a high expression level (at least 1000 reads covered) in the RNA-Seq sequencing dataset (SRR4024923.sra of Arabidopsis thaliana and ERR2870199.sra of Homo sapiens).These files can be downloaded from NCBI.These ten genes are also used for the generation of simulated data as a reference.We randomly generated a set of sequences normally distributed along with the reference genes with a length greater than 100 bp.And we set a 1% random SNP (single nucleotide polymorphism) variation on the simulated sequence dataset.The parameters used in iLoqu are minimum similarity percentage 1%, k-mer size 19, a minimum count of tag 30 and minimum assembly similarity 90%.Table 1 shows the assembly results of both test data.The assembly results were aligned back to the reference to show the accuracy using [24].
From Table 1, we can see that iLoqu successfully assembled the test data.In the test of Arabidopsis thaliana data, 98.3% and 68.3% of the reference is covered by the assembly result from the simulating data and real data, respectively.In the test of Homo sapiens data, the percentage is 96.6% and 77.1%.The depth and variations will affect the assembly quality.This result proves that the k-mer vector methods is a feasible solution for the assembly problem and it worth further studies.

Discussion
The usage of the generalized k-mer vector defined in this work is a method for describing a biological sequence or sequence set from its k-mer set.It is defined using the most basic common features extracted from different sequence analysis problems.Thus, it is a more general description method with better applicability and scalability when compared to existing research on matrix methods for sequence analysis problems.It has higher adaptability facing new problems.It also has better problem analysis capabilities when multiple sequence-based problems are combined as a whole.Many sequence-based problems in bioinformatics can be converted to their corresponding vector/matrix form using the k-mer vector under this definition.It can unmask part of the real nature of the sequence analysis problems that cannot be seen directly in other basic forms.It offers us a chance to rebuild a new analysis system.
There are complete research findings on linear space, matrix and matrix operation from linear algebra and computational mathematics.It gives us a whole new set of methods to deal with the sequence-based problems in bioinformatics studies.The challenge here is to use the language of biology to explain the mathematical concepts.In this work, we extend the value domain of the element in the k-mer vector from integer to real number.Furthermore, the vector operations are interpreted as operations of the k-mer sets of generalized sequences.It makes the matrix methods of biological sequences a practical tool for problem analysis and algorithm design.
The computation of sequence similarity is one of the common features of many sequencebased biological problems, such as the sequence alignment with reference, sequence assembly problem, and sequence evolution analysis.These features can be described uniformly using the k-mer vector form.It links different sequence-based problems or their upstream and downstream analysis as a whole, reducing the computational costs of the associated algorithm.The successful algorithm design examples using k-mer (for example, kallisto that avoid sequence alignment using k-mer [11]), indicates that k-mer is an effective tool for solving high complexity bioinformatics problems.While complimenting the algorithm design using k-mer vector/matrix form in the instance of sequence assembly problem, we find that the hashmap structure, k-mer and the sparse matrix match very well in the real problem.It is another evidence to support the feasibility of applying the sequence matrix methods theories in the bioinformatics.
The k-mer vector uses features extracted from a sequence to represent it.It has better control of the information redundancy which could be a burden for the design of computational analysis algorithms.On the other hand, it may cause a loss of information.More theoretical analysis and quantification tests need to be done to find the balance between different forms.The mapping between vectors in the space and the sequences is not a bijection.In practice, the effect of this shortcoming is limited and ignorable.It has been proved and used as a reasonable assumption by other researchers [6].A remaining big challenge is the visualization of the sequences from its matrix form.The string representation of biological sequences has its advantage naturally while the vector form makes the sequence structure abstract visually.
In the future work, we will present a solution for sequence visualization from its vector form.Then match more concepts from matrix analysis in algebra with the sequence analysis.Finally, try to build equations from sequence analysis problems and solve them in matrix form.Moreover, we need a high-effective data structure and improved algorithms for solving real problems.

§5 Conclusions
We gave a generalized definition of the k-mer vector and its linear space for the biological sequences.It is a method of representing k-mer set in the vector form to describe a sequence-based problem.The basic concepts and operations in vector space are reasonably related to their corresponding biological meanings.We use three examples to explain the theory, including the use of basic operations, the relationship between upstream and downstream analysis steps, and cases with probabilities.This is a way to discover new features of sequence analysis problems in bioinformatics and generate new algorithm design strategies.Also, we proposed a solution to the sequence assembly problem as an application example of algorithm design using the vector form of biological sequence, which proves the feasibility of the vector (matrix) method in real sequence analysis.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.The images or other third party material in this article are included in the articles Creative Commons licence, unless indicated otherwise in a credit line to the material.If material is not included in the articles Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.To view a copy of this licence, visit http://creativecomm ons.org/licenses/by/4.0/.
the depth on position i along sequence s and d i ∈ N + .Moreover, different sequencing technologies have different sequencing accuracy.The actual depth d ′ i on position i should be calculated as d i * p where * is the multiplication of real numbers and the sequencing accuracy is denoted by p.It shows that the depth value can be adjusted from a positive integer to a positive real number.Thus

Table 1 .
The test result of iLoqu with the simulating data and real data.