A fast and efficient algorithm for DNA sequence similarity identification

DNA sequence similarity analysis is necessary for enormous purposes including genome analysis, extracting biological information, finding the evolutionary relationship of species. There are two types of sequence analysis which are alignment-based (AB) and alignment-free (AF). AB is effective for small homologous sequences but becomes NP-hard problem for long sequences. However, AF algorithms can solve the major limitations of AB. But most of the existing AF methods show high time complexity and memory consumption, less precision, and less performance on benchmark datasets. To minimize these limitations, we develop an AF algorithm using a 2D \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$k-mer$$\end{document}k-mer count matrix inspired by the CGR approach. Then we shrink the matrix by analyzing the neighbors and then measure similarities using the best combinations of pairwise distance (PD) and phylogenetic tree methods. We also dynamically choose the value of k for \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$k-mer$$\end{document}k-mer. We develop an efficient system for finding the positions of \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$k-mer$$\end{document}k-mer in the count matrix. We apply our system in six different datasets. We achieve the top rank for two benchmark datasets from AFproject, 100% accuracy for two datasets (16 S Ribosomal, 18 Eutherian), and achieve a milestone for time complexity and memory consumption in comparison to the existing study datasets (HEV, HIV-1). Therefore, the comparative results of the benchmark datasets and existing studies demonstrate that our method is highly effective, efficient, and accurate. Thus, our method can be used with the top level of authenticity for DNA sequence similarity measurement.


Introduction
Sequence analysis is a trending research arena in the field of bioinformatics, bioinformatics engineering, and computation biology. It is obligatory for analyzing the evolutionary relationship among different living objects from whole genomes, finding gene regulatory regions, identifying virus-host interactions, detecting horizontal gene transfer, analyzing the similarity of sequences, extracting different biological information, etc. [28]. Day by day, biological information extraction from the whole genome is becoming important because of rapid expansion (approximate growth rate is doubling data in every 18 months) of biological data from the last few decades [34]. Broadly, there are two types of sequence analysis: AB and AF where AB algorithms have several limitations. For example, it provides better results only for homologous sequences, it works for comparatively smaller sequences and these algorithms are time and space consuming. For multiple and long sequences, it becomes N P-hard problem. However, AF algorithms can solve the major limitations of the AB algorithms [40,41]. Due to their time and memory efficiency, AF methods are widely used in different free, paid, open and publicly available software including MEGA (Molecular Evolutionary Genetics Analysis) [13], MEGA7/X [13], CAFE (aCcelerated Alignment-FrEe sequence analysis) [22], Co-Phylog [35], etc.
Different AF-based researches have been conducted on sequence similarity analysis. Among them, pattern histogram [23], suffix tree count [18], k − mer encoding-based image analysis [6][7][8], chaos game representation (CGR) approach [2,30,38], convolutional neural network (CNN) approach using CGR image [30] are extensively used in different studies. We discuss different AF models, their strengths and limitations in the second section. From that analysis, we find that most of the AF-based approaches have some general limitations, e.g., high time complexity, high memory consumption rate, less precision, lack of optimal k − mer selection, achieving high performance by testing their model in smaller datasets, and lack of comparison to benchmark dataset.
Therefore, in this research, we aim to develop an AF sequence similarity measurement model that will overcome the limitations of existing models. For any sequence, our model dynamically selects k for k − mer by considering the whole dataset. Then it generates a 2D count matrix of k − mers in a fast and efficient way by utilizing an accurate calculation of the position of k − mer strings in the 2D matrix. After that, it shrinks the 2D matrix by analyzing neighbors and then generating a 1D feature descriptor. Then, we experiment to find the best combinations of distance and phylogenetic tree generation methods to achieve high precision. Thus, the method effectively calculates the similarities of any sequence dataset.
The rest of the manuscript is organized as follows. In the next section, we discuss different existing AF models with their strengths and limitations. In the subsequent section, we present our novel sequence similarity measurement method. Then we discuss different datasets, performance on different datasets, and performance of the overall system in comparison to existing studies. Finally, we summarize the contributions and limitations of our system. Also, we put some future directions.
The details of the dataset, implemented code are publicly available (https://drive.google.com/drive/folders/1NIJUqtH ryV7nhzPRbKyJT8U6ZTYpre2U?usp=sharing). Ren et al. [28] performed a comparative study to analyze the pros and cons of different AF algorithms in the field of sequence analysis. The study also mentioned that the AB approaches provide higher accuracy than AF methods. However, AF methods are computationally efficient and have less memory consumption rate. Yang et al. [34] mentioned different approaches in their study for encoding DNA sequence to numbers e.g., sequential, one-hot [33] and k − mer encoding. They presented several issues, e.g., choosing appropriate encoding, feature extraction technique, choosing the right distance measuring technique may affect overall performance.

Background study
Jin et al. [14] analyzed different methods used for DNA sequence similarity identification and they mentioned that a good similarity algorithm should have the following ability (i) should have a strong encoding technique to reduce the information loss, (ii) extracted features should work for small, large and mixed length (the length varies from 10 2 to 10 10 or more) sequence, (iii) should have high precision, less time complexity, and space consumption rate. Luczak et al. [23] surveyed to evaluate different histogram-based distance matrices used for phylogeny analysis. They mentioned that to achieve better accuracy, the size of k − mer should be increased for comparatively larger sequences. Zielezinski et al. [40] developed a benchmark for comparing thousands of AF algorithms developed by targeting different sequence analysis studies. They launched a web portal named AFproject 1 in which anyone can submit their self developed AF algorithm to evaluate the comparative performance score among reference algorithms and datasets. Klötzl et al. [18] developed a suffix tree based algorithm and claimed their method is faster and accurate than Mash [26] and other pairwise algorithms. However, in the AFproject web portal, they obtained RF distance of 6.00 for fish dataset.
Chen et al. [7] developed a method for phylogeny analysis where they converted a DNA sequence to a digital vector by assigning 1 − mer (A = 1, C = 2, G = 3, and T = 4) and combined it with index information. After that, a gray level co-occurrence matrix (GLCM) was calculated from the vector. Again, Chen et al. [6] extended their previous work using 2 − mer and got comparatively good results in respect to previous studies. However, in both studies, the dataset was very small in comparison to the benchmark dataset. Similarly, Somodevilla et al. [32] used 1 − mer (A = 1, C = 0.5, G = 0.75, and T = 1) encoding for generating an image. Later, they used CNN for DNA sequence classification. However, they faced a time complexity issue. Delibaş et al. [8] proposed a method by utilizing first-order statistical concepts from an image texture. They used four small datasets and compared their dendrogram with MEGA7, and ClustalW. However, they did not apply their method for a large benchmark dataset to find their methods' accuracy, error and rank. Again, Delibaş et al. [9] proposed top −kn − gram based solution and calculated top−kn−gram from the count of k − mers. They applied their system in different datasets including AFproject [40] fish benchmark dataset where they achieved rank 6 with 68% accuracy.
Chaos game representation (CGR) is a square matrix of k − mer counts in a genome sequence. Traditionally, it is calculated based on coordinate values of ancestor and predecessor DNA bases [2,30]. Zheng et al. [38] used the CGR image technique for circRNA disease association finding. Dick et al. [10] mentioned that coordinate point-based CGR calculations have several limitations. For that reason, they proposed four different CGR (FCGR, 20-node-amino-acid CGR, 20-node-amino-acid FCGR, and 20-Flake-FCGR) representations for protein classification. However, Changchuan Yin [36] showed that coordinate-based CGR matrix calculation highly suffers from floating-point error and an integer representation may provide a good result. Safoury et al. [30] worked for DNA sequence classification using convolutional neural network (CNN) from CGR image. They prepared a square matrix of k − mer sequences, where each cell contains the number of counts of a specific sequence and their accuracy was 51% to 100%. In addition, the method takes a huge time to generate split images to train CNN. Rizzo et al. [29] developed a DNA sequence classification based on CGR image. Löchel et al. [20] also used deep learning (DL) techniques for proteins classification and used CGR images to train the DL model. All of the studies mentioned that DL-based methods have huge time complexity. Kania et al. [17] analyzed the behavior of CGR implementations and sequence correlations and found that there was a strong relationship between k − mer with accuracy. Besides, k − mer frequency counting CGR (FCGR) methods were more sensitive for representing mutations, but it increased the time and space complexity.
Ni et al. [25] developed a method for DNA sequence similarity where they used the FCGR technique. Generally, 8 − mer generates a 2D matrix of (4 4 × 4 4 ) dimension that contains 4 8 = 65, 536 pixels. They reduced this vector with the concept of bicubic interpolation technique which returns a 2D matrix of (16 × 16) dimension, then it was converted to 1D (1×256) vector. Thus, it reduces the vector size as well as time complexity. Later, perceptual image hashing difference was used for sequence similarity calculation. This method was tested on 21 HIV-1, 48 HEV, 8 mammalian chromosome DNA, and 25 Fish DNA from AFproject [40]. Among the results, for AFproject [40] dataset they obtained rank 2, RF distance 4, and score 91. Hence, the method reduces the time complexity and achieved a good performance. However, for the same dataset other methods exist in AFproject [40] that achieve performance rank 1, RF distance 2, and accuracy 95. So, there are scopes to improve the accuracy or optimize the time complexity and accuracy. AFproject [40] web portal so that the researchers and users can publicly see the performance. It will help to compare a new model.
Hence, we aim to develop a DNA sequence similarity technique that will address all of the above limitations.

Proposed methodology
In this section, we describe the detailed procedure of our proposed sequence similarity identification from raw sequences. The overall procedure of our proposed method is presented in Fig. 1.

Dynamic k for k − mer selection
In DNA sequence analysis research, the AF algorithm works better than AB algorithms [28]. Generally, genome sequences (e.g., "AATTTTTAACG") are large string consisted of different DNA bases ("A","T","C","G"). AB algorithm considers whole string and aligns one by one base, hence, these algorithms show huge time complexity for analyzing multiple sequences. On the contrary, the AF algorithm considers different smaller DNA sequence subsets which are known as k−mer and then applies different count, histogram, network or probability algorithms. A k − mer is a subset of a DNA sequence of a specific length [23]. However, AF algorithms represent the large sequences by a different form, e.g., number rather than a string. Different researchers used various lengths of subsets in their AF model. Safoury et al. [30] used two different values of k for k − mer which is very time consuming. However, the performance of a model highly depends on choosing the right number of k [14,23]. As it has a crucial role, choosing an appropriate number of k is very challenging and developing a method to choose the dynamic number of k is time demanding [23]. Therefore, we propose Algorithm 1 for finding the appropriate number of k for k − mer. In this algorithm, first, we read N number of sequences from the dataset, then make a vector V to keep the individual lengths of N sequences, then the average length L is calculated from V . Based on L, the algorithm selects the value of k.
k ← 9, when 100000 ≤ L ≥ 9999999 8: k ← 10, when L ≥ 10000000 9: end if 10: return k DNA sequence to 2D k − mer count matrix After choosing an appropriate number of k using Algorithm 1, our process generates a 2D matrix using Algorithm 2 which is graphically presented in Fig. 1. Different researchers used a coordinate-based CGR approach to generate a 2D matrix from a sequence where they used a coordinate averaging technique to move from the previous point to the next point [2,10,30,38]. However, due to averaging technique, these methods have suffered from floating-point error which interrupts achieving high precision. Moreover, different studies used frequency chaos game representation (FCGR) for the sequence to image conversion. Adetiba et al. [1] developed an FCGR from the derivatives of CGR images where they found improved accuracy for increasing the number of derivatives. But the derivative process is computationally inefficient. Löchel et al. [20] developed a FCGR matrix based on contraction ratio which is suffers from floatingpoint error. Joseph et al. [16] also used the FCGR technique by dividing the CGR image into 4 blocks where each block was generated by averaging the coordinates of base points. This method also suffers from floating-point error. Rizzo et al. [29] used deep learning-based FCGR image analysis where they calculated the frequency of a k − mer by iterating the whole sequence. So, their method consumes a huge amount of time to execute. Messaoudi et al. [24] implemented a technique for FCGR calculation by moving a template window (k − mer) among the whole sequence and counting the number of full matches. They generate a different number of orders of FCGR to enhance the performance. This method also consumes huge CPU time which is almost O(n 2 ). Different studies [11,19,25,39] developed FCGR matrix based on coordinate averaging technique for detecting biological sequence. Therefore, it is necessary to develop an accurate and time-efficient CGR count matrix for sequence analysis.
Hence, we aim to develop a method that will generate a 2D k − mer count matrix where the cells of the matrix are distributed according to CGR formation. CGR is a method that iteratively represents the bases ("A","T","C","G") of a DNA sequence using the coordinates of a square matrix M or gray level image where the size of matrix is (2 k × 2 k ), here k is the length of k − mer string. This process assigns one cell for each k − mer string using Algorithm 3, and the value of each cell indicates the frequency of the specific k −mer string using Algorithm 2. It is possible to reconstruct the source sequence from the coordinates by backtracking. This CGR square matrix or gray level image is suitable for finding the similarities among DNA sequences [2,25]. Our method calculates the position of a k − mer string in the 2D matrix using Algorithm 3 without averaging technique. Hence, 2D count matrix generation by our method will be highly effective in terms of accuracy and time. In Fig. 2, we present the 2D matrix expansion process. Therefore, using Algorithm 3 and Algorithm 2 we generate a 2D count matrix.

Matrix shrinking and feature descriptor
A k − mer count matrix contains the major detailed information of a sequence. With the increase of k − mer, the size of the count matrix also increases significantly. If we use the whole matrix as a feature vector, it will be computationally Algorithm 3 k − mer string mapping in 2D count matrix 1: Input: kmer string 2: Output: x and y positions if i = 1 then 7: x ← 1, y ← 1 when c = A 8: x ← 1, y ← 2 when c = C 9: x ← 2, y ← 1 when c = G 10: x ← 2, y ← 2 when c = T 11: end if 17: end while 18: return x and y ineffective. But, it is necessary to develop a method that will work for long, medium and short length sequences [14]. So, different researchers proposed different methods (e.g., linear interpolation, bicubic interpolation [25]) for matrix or data shrinking. In the 2D k − mer matrix, each cell is important as it contains the number of occurrences of a specific sequence in the whole sequence. Ni et al. [25] used bi-cubic interpolation which shrinks the vector enormously but the performance of their method on the benchmark dataset is not very promising. Hence, we propose a method to shrink the k − mer count matrix that calculates the square of the mean value of neighboring elements. The detailed procedure is presented in Algorithm 4.
Hence, according to Algorithm 4 for a known k or k −mer, If we shrink the vector by shrink rate S r then the output matrix M s will be (d × d ) where d = d/sqrt(S r )). Then, convert 2D M s to 1D D s by row column shifting. Hence, D s is the feature descriptor.
Algorithm 4 2D k − mer count matrix shrinking algorithm using square of mean value of neighboring elements.

Cosine distance and phylogenetic tree construction
Statistical distance calculation methods highly depend on data and pattern distributions [23]. Cosine similarity provides good results for k − mer probabilities or the count matrix [5,23,27]. Let the length of descriptor D be the dimension of the vector. It calculates the angle between two vectors using Eq. 2. The smaller value of angle indicates a good similarity which also indicates the two vectors are parallel. However, cosine distance is measured by 1 − cosine value. Let, the two descriptors for two sequences are x = D s1 and y = D s2 and their cosine is the inner product of two vectors divided by their magnitude defined in Eq. 1.
where n is the length of descriptor x and y, here both descriptors are of same length. The upper part of the equation represent the dot product of the vectors and x and y represent the magnitude of the vector x and y respectively.
Again, x k is the k th element of descriptor x and y k is the k th element of descriptor y.
where Cos(θ ) is measured in Eq. 1. Hence, the L value is the cosine distance between two sequences. Therefore, we can apply this technique for more than two sequences by adopting the one-to-one comparison technique. If the number of sequences is n, then the length of L will be z = (n * (n − 1))/2, thus the dimension will be 1 × z, respectively.
Again, a phylogenetic tree is a prime tool to visualize the genetic relationship [3]. We use seqneighjoin function which takes a new value from distance matrix L that is considered as a new node q. Then it computes the distance of q versus all existing nodes. Hence, in each iteration, a new q is considered and overall similarity values are updated. For any node q the distance matrix calculation is presented in Eq. 3.
where L is the distance matrix, q is a new node, i and j are the iteration variable, p is the set of all existing nodes, x is 1/2 for eqiuvar [31], and 0 to 1 for firstorder [12] method.

Results and discussion
In this section, we discuss dataset collection, the performance achieved on benchmark and existing datasets, the effectiveness of our model, and some comparison with existing works. We use a total of 6 standard genome datasets that are collected from different benchmarks and existing studies. Among them, we use the first 2 datasets for benchmark testing, the second 2 for comparing the accuracy and the rest 2 for memory and space analysis. The details of the 6 standard genome datasets are (i) complete mitochondrial DNA sequences of 25 cichild fish   [8,9], (iv) 18 Eutherian mammals [8,9], (v) HIV-1 [25], and (vi) HEV [25]. Among them, the first 2 are open challenge datasets from AFproject [40] where they evaluate the performance and ranking of different AF algorithms used for sequence similarity identification. Rest 2 (16 S ribosomal, 18 Eutherian mammals) are collected from different existing works [8,9]. We use another 2 datasets (HIV-1, HEV) for memory and space analysis those are taken from Ni et al. [25] which can be found from the following URL. 2

Software and server configuration
We simulate our method in 2.80 GHz Intel(R) Core i5 computer with 8GB DDR3 RAM. As a development tool, we use the MATLAB 2021a version. The details of the dataset and implemented code are publicly available (https://drive. Yersinia (Table 2)

k for k − mer and shrink rate (S r ) selection
In AF algorithms, the right number of k selections plays a vital role in achieving the overall performance of a model [14,23]. However, increasing the number of k also exponentially increases time complexity. Again, it is not optimal to use the full k − mer count matrix as a feature vector because it increases the distance calculation time. So, we develop Algorithm 4 to shrink the vector size. Hence, to build an optimal model, we need to choose the best combination of k and S r with respect to different datasets. Here, S r = 1 means no shrink. For each k and S r , we experiment with different combinations of pairwise distance (PD) and phylogenetic tree generation methods. Hence, we find 88 combinations (details are available in Table 4) and find 88 RF distances. Among them, we consider the minimum RF value which is listed in Table 3. Therefore, the best result will be the minimum RF value achieved for the combination of a smaller number of k and a larger number of S r . From Table 3, we see that for the Fish dataset (Table 1), best result R F = 2 achieved for k = 8 and S r = 4. In case of Yersinia dataset (Table 2), best R F = 0 found for k = 9 and S r = 4. In 16 S Ribosomal dataset, best R F = 0 for k = 8 and S r = 4, and in the 18 Eutherian Mammal dataset, best R F = 0 for k = 8 and S r = 16. Generally, with the increase  of S r value, the performance degrades for all k. Moreover, we found that minimum k value 8 provides the best result for three datasets except for Yersinia. In the case of Yersinia, k = 9 provides the best result. We investigate the reason and find that the average length of the sequences in the dataset plays a crucial role in selecting k value. When the average length is less than 10 5 , then k value 8 provides the best result. For Yersinia, the best result for k is 9 because its average length is 5 × 10 6 . In the case of S r , all datasets except 18 Eutherian mammals provide the best result for large S r = 4 whereas the 18 Eutherian dataset provides the best result for S r = 16. Therefore, we set S r = 4 for the four datasets. Based on the RF distance in Table 3, we develop Algorithm 1 to dynamically select the k value. Therefore, we can say that our model is suitable for any DNA sequence similarity dataset and our Algorithm 1 is very effective for the length of k − mer selection. Also, Algorithm 4 shrinks the matrix efficiently.

PD and sequence joining method selection
To calculate the DNA sequence similarity, we need to measure distances using feature vectors. It involves two steps, first finding the PDs from feature vectors and then generating a phylogenetic tree from distances. MATLAB provides different PDs and phylogenetic tree generation methods. Generally, there may be performance variations in choosing different combinations of PD and phylogenetic tree generation methods. Hence, choosing an appropriate combination of both is a great challenge. Here, we use the best combinations of k and S r selected from Experiment 4.2. To find out which combination is best for our model, first, we apply each tree generation method with each distance method for the Fish dataset k = 8 and S r = 4 and calculate their RF distance presented in Table 4. Further, tree generation methods are of two types, e.g., seqlinkage and seqneighjoin. Hence, we find 88 RF distances for the Fish dataset for 88 different combinations. From Table 4, we see that the minimum RF distance of 2 marked by * sign is achieved by cosine distance and seqneighjoin with firstorder or equivar method. However, we also observe that for all PD methods, seqneighjoin technique provides better results than seqlinkage. For this dataset, our method achieves the best result (RF distance) in 5 combinations. Interestingly, in all cases, the seqneighjoin phylogenetic tree method provides the best results. Therefore, the combinations of cosine and seqneighjoin is the best pair for Fish dataset sequence similarity.
Similarly, we evaluate our method on Yersinia dataset ( Table 2) with k = 9 and S r = 4 in Experiment 4.2. The result is presented in Table 5. This time, we obtain best result RF distance of 0 for 11 different combinations. Five PD meth-ods (cosine, squaredeuclidean, seuclidean, correlation, and spearman) are combined with seqneighjoin which provide good results. However, for both Fish and Yersinia datasets, the combination of cosine and seqneighjoin provides the top score. Hence, after rigorous experiment on two datasets for 176 combinations, we select cosine and seqneighjoin methods as the best combination which can be very effective for sequence similarity analysis.

Performance evaluation on fish benchmark dataset
To evaluate the strength of our proposed algorithm we apply our method in the Fish (Table 1) dataset from AFproject [40]. About one hundred algorithms were submitted for benchmark ranking in the fish dataset. There are 25 sequences for the cichlid genome and their length varies from 16 to 17 thousand bases. These sequences are very similar. Therefore, it is very challenging to identify the accurate similarity or hierarchy for this dataset. AFproject [40] considers three parameters for evaluating algorithms. These are (i) Robinson-Foulds (RF) distance [4,21] to calculate the distance among phylogenetic trees, (ii) normalized Robinson-Foulds (nRF) that calculates a topological mismatch for a given tree with respect to a reference tree and (iii) normalized quartet distance (nQD). We can convert nRF value to accuracy using Eq. 4.
where n R F is normalized Robinson-Foulds value.
To compare the performance among the methods we consider three parameters from AFproject [40] (URL 3 . In Table 6, we list the top 5 methods where our model is on the top rank with RF distance 2.0 and accuracy 95%. Also, in Fig. 3, we present the phylogenetic tree generated by our method. We use k = 8 for matrix generation and S r = 4 for shrinking matrix. The comparative results and phylogenetic tree indicate that our method provides the best result for sequence similarity identification. Besides, Ni et al. [25] applied k = 8 for k −mer CGR matrix with a dimensionality reduction technique on the same dataset and they achieved rank 2 with RF distance 4.0 and accuracy 91%. However, among 25 sequences 4 sequences are highly similar to one another, due to the reason none of the AF algorithms can achieve 100% accuracy for this dataset. This clearly demonstrates that our method is one of the top-performing methods. Here, we present top 5 methods among 80 methods. Bold and (*) sign represents the performance our method

Performance evaluation on Yersinia benchmark dataset
Again, we apply our method in Yersinia (Table 2) benchmark dataset from AFproject [40]. It consists of 8 sequences of Yersinia species where the length varies from 4.5 to 4.7 million bases. However, this dataset is practically large. Approximately 80 algorithms have been submitted for benchmark ranking in the Yersinia dataset. We conduct experiment on similar way of Experiment 4.4 AFproject [40] which can be found on (URL 4 . In Table 7, we list the top 5 methods where our method scores top rank with RF distance 0.00 and accuracy 100%. We also present the similarity identification result using a phylogenetic tree in Fig. 4. Based on Experiment 4.2, to achieve the best result, we use k = 9 and S r = 4 for 2D matrix generation and shrinking. However, according to Table 7, our method achieves the best result for this large dataset which is also inferred in the phylogenetic tree. Hence, it indicates that our method is the best fit for sequence similarity identification. Again, Ni et al. [25] mentioned that if the size of the descriptor is large then it keeps more information than the smaller descriptor. Hence, our solution achieves top ranking among almost 80 algorithms which clearly demonstrates that our method is very suitable for similarity identification from a large sequence dataset.

Phylogenetic analysis of 16 S ribosomal DNA of 13 bacteria
We choose another dataset from Delibaş et al. [9] which consists of 13 bacterial data of 16 S Ribosomal DNA sequences with description, accession code to access from NCBI URL 5 , and sequence length. Each sequence has a length of approx- Fig. 3 Phylogenetic tree of 25 fish genome sequences described in Table 1. using our proposed method with k − mer = 8 and S r = 4 Fig. 4 Phylogenetic tree of 8 Yersinia genome sequences described in Table 2 using our proposed method with k = 9 and S r = 4 imately 1500 bases. Among the sequences, some of the sequences are highly similar, and the rest are well separated. First, we generate a Newick tree using MEGA7/X software with the following setup: ClustalW alignment with default settings of pairwise and multiple alignments. Then we use UPGMA mega tree to build the phylogenetic tree and Newick tree string. Second, we generate phylogenetic tree using our proposed method with k = 8 and S r = 4 which are chosen from Experiment 4.2. The phylogenetic tree generated by our method is shown in Fig. 5. Then, we compare the Newick tree generated by MEGA and our method, and the comparative result is presented in Table 8. We can see that our method achieves 100% accuracy for this dataset which is very promising and definitely ahead of Delibaş et al. [9] result. It also indicates our method is very effective for the sequence similarity identification of smaller sequences (e.g., 16 S Ribosomal DNA of 13 bacteria).

Phylogenetic analysis of 18 Eutherian mammals
We choose another existing dataset 18 Eutherian Mammal used by Delibaş et al. [9], Jin et al. [15] etc. Sequence length varies approximately from 16 to 17 thousand. In this dataset, we experiment in two steps like 16 S Ribosomal dataset Experiment 4.6. We generate a phylogenetic tree using k = 8 and S r = 4 is shown in Fig. 6 and then compare the Newick tree with MEGA7. The comparative result is presented in Table 9. We can see that our method achieves 100% accuracy for this dataset too which is very promising and clearly ahead (19% more accurate) from Delibaş et al. [9] result. It also indicates our method is very effective and efficient for whole genome DNA sequence similarity identification. Here, Column 1 represents the methods, Columns 2 and 3 list most important two parameters, and last column represents the performance achieved by each method. Bold and (*) sign indicates the best result

Performance in terms of time and space
The effectiveness of any computer algorithm is measured by several parameters. Among them, the time complexity is most important [37]. Because it indicates how faster an algorithm can provide results. Different researchers including Delibaş et al. [9] computed the complexity in terms of machine clock cycle. However, we discuss our time complexity in two steps. First, we express time complexity using θ and O notation. Let, the sequence dataset consists of N number of sequences, where each sequence has a maximum of L length and k is the length of k − mer string. There are several steps to compute the time complexity which are presented in Table 10. Hence, total complexity is O(N × L × 2 k ). Second, we calculate the time using tic and toc function. We also compare the results with existing work in Table 11. Further, space is another parameter that can express the quality of the developed algorithm. We have calculated memory consumption using memory function. To compare the memory consumption of our method, we consider HIV-1 and HEV datasets. From Table 11, we can see that in case of HEV dataset our method is 7,079 times faster than Ni et al. [25]. An almost similar result is obtained for the HEV dataset. Again, our method is approximately 21 times faster than Delibaş et al. [9] for 18 Eutherian mammal dataset. In terms of memory consumption, our method takes 12.85MB less memory than Ni et al. [25] for the HIV-1 dataset. Therefore, we can say that to provide faster results with less memory consumption, our method is the best fit among all existing methods.

Impact of proposed shrinking algorithm
In our system, most time and space consumption part is the count matrix and the next is the pairwise distance calculation. Let, a vector F with the dimension of (N × P), where N is the number of sequences and P is the length of the 1D descriptor which is termed as D s in "Matrix shrinking and feature descriptor" section. Hence, the computational complexity of pairwise distance calculation is N (N −1) 2 × 3P [23]. Therefore, in our case, the computation highly depends on the value of P as N is very small compared to P. That is why we aim to reduce the size of P. In Table 12, we compare phylogenetic tree generation time using different shrink

Conclusion
In this research, we develop a method for sequence similarity measurement of any sequence dataset that dynamically selects k for k − mer and effectively generates a 2D k − mer count matrix with appropriate shrinking and then applies the best combinations of PD and phylogenetic tree generation method. After comprehensive experiments, we can conclude that our dynamic k for k − mer selection algorithm is very essential to achieving the best result. After rigorous experiments on benchmark datasets, comparison with existing studies, phylogenetic analysis and RF distances from reference trees, we can conclude that our 2D k −mer count matrix generation is very much faster, accurate, effective and robust for DNA sequence analysis. Our matrix shrinking, effective position calculation, and optimal combination of PD and phylogenetic tree generation method selection achieve the best performance in terms of time and space. Hence, we can conclude that for sequence similarity analysis our method is novel, robust, faster and accurate. Therefore, we can use it with a good level of reliability. The contributions of our method are as follows: -We achieve a top rank score in two benchmark datasets (Fish and Yersinia) among two hundred methods. -We achieve 100% accuracy for two other datasets (18 Eutherian, 16 s Ribosomal) which are clearly better than other existing methods. Table 10 Step-wise time complexity calculation for our proposed method Step Method Time complexity Step 1 Dynamic k − mer selection N + 2 (Algorithm 1) Step 2 2D k − mer matrix generation L × k × 6 Step 3 Matrix shrinking 2 k × 2 k Step 4 1D feature descriptor 2 k Step 5 Distance and phylogenetic tree 2 × N Final complexity O(N × L × 2 k )  -Our proposed method is faster than existing AF-based methods as well as AB algorithms. -Proposed system consumes several times less memory than existing methods. -Our method dynamically choose the value of k to generate 2D k − mer matrix using Algorithm 1. -It takes less time to generate 2D kmer matrix in comparison to others because of our Algorithms 2 and 3. -Our smart system automatically shrinks the size of feature vector using Algorithm 4 resulting in higher accuracy and minimizing time complexity.
However, our method achieves extraordinary performance for six datasets. In the future, researchers can use more benchmark datasets including COVID 19 and others. Moreover, time and space consumption rates are still a major concern. Finally, researchers can investigate deep learning-based text processing techniques and rough set algorithms for improved performance.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecomm ons.org/licenses/by/4.0/.