A new fast technique for pattern matching in biological sequences

At numerous phases of the computational process, pattern matching is essential. It enables users to search for specific DNA subsequences or DNA sequences in a database. In addition, some of these rapidly expanding biological databases are updated on a regular basis. Pattern searches can be improved by using high-speed pattern matching algorithms. Researchers are striving to improve solutions in numerous areas of computational bioinformatics as biological data grows exponentially. Faster algorithms with a low error rate are needed in real-world applications. As a result, this study offers two pattern matching algorithms that were created to help speed up DNA sequence pattern searches. The strategies recommended improve performance by utilizing word-level processing rather than character-level processing, which has been used in previous research studies. In terms of time cost, the proposed algorithms (EFLPM and EPAPM) increased performance by leveraging word-level processing with large pattern size. The experimental results show that the proposed methods are faster than other algorithms for short and long patterns. As a result, the EFLPM algorithm is 54% faster than the FLPM method, while the EPAPM algorithm is 39% faster than the PAPM method.


Introduction
A sequence, text, or database is scanned to discover the positions of a pattern in the text in pattern matching [1,2]. This type of problem must be addressed for a variety of reasons, including its applications in image and signal processing, search engines, text processing, information retrieval, question-answer systems, and chemistry [3][4][5][6].
Pattern matching issue is prevalent in several areas of computational bioinformatics, such as basic local alignment search, biomarker identification, sequence alignment, proteogenomic mapping, homologous series recognition and proteogenomic mapping. In these fields, it is necessary to detect the positions of patterns in databases, as well as those of nucleotides and amino acids [7][8][9]. Gene analysis and DNA sequences can be used to investigate suspected illness or anomaly diagnoses in biotechnology, forensics, medicine, and agriculture research. DNA sequence analysis can be used to compare a gene to comparable genes in the same or different animals, as well as to predict its function. Another application, the functionality of a newly found DNA sequence may be predicted by comparing it to existing DNA sequences. This method has been employed in several medical investigations and applications.
The problem of inefficiency in exact pattern matching, which entails detecting all instances of a pattern in a text, is addressed in this study. The goal of this study is to improve the efficiency of FLPM and PAPM algorithms in order to solve the computational cost problem that has plagued earlier studies [29]. Unlike the literature [20-28, 30, 31]. The proposed methods do not have preparation and matching steps. Similar intervals of the text to be matched with the pattern are detected by the matching procedure. In this work, the FLPM and PAPM algorithms' preprocessing stage will be modified to reduce comparison time. For the new approach, we calculate the number of windows, then compare them again at the matching stage.

Motivations of using pattern matching
Pattern matching is a technique for determining whether or not the elements of a pattern are present in an observed string sequence. Unlike pattern recognition, the match must usually be exact. Pattern matching sequences often include outputting the locations of a pattern inside a string sequence, outputting some component of the matched pattern, and substituting the matching pattern with another string sequence (i.e., search and replace). The pattern matching idea is used in a variety of applications. Figure 1 depicts these uses.
For pattern matching investigations, we focused on DNA sequence. To handle this kind of data, we need a better search technique. Matching patterns will help you achieve the best and most appropriate result. To detect matching patterns, many algorithms have been utilized.

Algorithm's notation
MatchingAlgorithms utilize the notation shown below. Section 2 introduces a review of related work, and Sect. 3 explains the problem. The proposed algorithm is described in Sect. 4. Section 5 compares the performance and effectiveness of the PM algorithms against those of other algorithms. Finally, Sect. 6 brings the research to a conclusion and future work.

Related Work
Several pattern matching algorithms have been developed in order to reduce the number of comparisons done during search operations. To reduce the number of comparisons, the matching process is usually split into two parts. During the preprocessing and searching phases, the distance (shift value) by which the pattern window will move is determined. During the searching phase, this shift value is used to find the pattern in the text with as few character comparisons as possible. In this section, we will show two earlier pattern matching algorithms, the FLPM and PAPM algorithms, which have been improved by our proposed methodology.

First-Last-pattern matching (FLPM) algorithm
FLPM [29] is a Divide and Conquer Pattern Matching (DCPM) upgrade that consists of two stages, pre-processing and matching. Comparisons are the basis of FLPM. In the pre-processing stage, the text is scanned to identify windows that will be used in the matching stage later. The search will be in the range t [0… n-m] assuming that m is the pattern length and n is the text length. If the two characters Fig. 1 Applications of pattern matching [32,33] are identical, then increasing the number of windows and the algorithm compares the first character in the pattern with the first character in the text and the last character in the pattern with the last character in the text. This method is repeated throughout the pre-processing stage. During the matching step, it scans the windows to find all instances of the pattern inside the text.

Processor-aware-pattern-matching (PAPM) algorithm
This method differs from the FLPM algorithm in how pattern p characters and text t characters are compared PAPM [29]. It compares words composed of many characters, whereas FLPM use a character-based pattern matching method. It compares patterns using a term made up of numerous letters and is powered by the processor. PAPM algorithm compares two words at a time, with word_len = b/8 determining the length of each word, where b-bit represents the kind of processor (either 32 or 64). If b = 32, it will compare two four-letter words. The pre-processing method examines the interval t [0… n−m] for the first word of the pattern in the text. To find the word p [0…word len1], the beginning of the index is preserved in the windows array. When the PAPM algorithm searches a window, the bigger the number of characters (words) examined in the text results in fewer windows, which minimizes the time necessary for the next step

Research problem and the proposed solutions
The bulk of biological data has expanded significantly in recent years. Hey must be examined in a fair amount of time. This problem arises in molecular biology because amino acid or nucleotide sequences are frequently used to approximate biological molecules. Another example is the basic knowledge of species DNA sequences and the difficulty in retrieving this information by pattern matching. Moreover, identifying potential irregularities or mistakes in a DNA sequence frequently necessitates DNA sequence analysis. Pattern matching is also useful in domains such as phylogenetic and evolutionary biology. Specific DNA sub-sequences are retrieved from the genomic data of various creatures' species in these applications to better comprehend their relatedness, ancestry, and origin. Regarding this context, a pattern matching algorithm must be able to search in datasets spanning gigabytes to terabytes or more, as well as complete genomes containing 3 billion base pairs [14]. The DNA sequences, on the contrary, are extremely lengthy. As a result, the time spent in matching with the pattern is regarded as the most essential factor.
Consider the text t with length n over the alphabet ∑ and the pattern p with length m. A string is defined as a series of 0 or more alphabet symbols ∑ represents the set of all possible strings over the letter ∑*. If x = abc, where a, b, and c ∈ ∑ * , then c is a substring of string x. Pattern p is stored in an infinite array p [0..m − 1], in which m > 0. The (i + 1)-st symbol of p is represented by p[i], in which 0 ≤ i < m. Furthermore, p[i…j] denotes a substring of p from the (i + 1)-st symbol to the (j + 1)-st symbol of p, where 0 ≤ i ≤ j < m.

A new fast technique for pattern matching in biological…
The text is scanned in the proposed pattern matching methods to find windows of length m. In the matching step, the algorithms compare the pattern's characters one by one with those in the window to determine the overall pattern's look. The other windows are checked after a complete match or mismatch to see if they match the text.

Methodology
This section describes a straightforward Enhance-First-Last Pattern Matching (EFLPM) method and Enhance-Processor-Aware Pattern Matching Algorithm (EPAPM). EFLPM is an enhancement to FLPM that combines the pre-processing and matching stages of FLPM into a single phase to minimize time complexity.

The proposed EFLPM algorithm
The FLPM pre-processing stage scans the text to highlight text windows that will be used later in the matching stage, because FLPM is based on comparisons. The windows whose first and last characters in the pattern match the first and last characters of the text in pattern size are extracted during the pre-processing stage. If the initial and last characters match, they will be added to the matrix of windows; if they don't, the pattern will be moved one letter and the process will be repeated. The procedure is then repeated throughout the paragraph. The matching stage involves comparing the extracted windows to the pattern once more. The proposed EFLPM algorithm flowchart is shown in Fig. 2.
The steps of the proposed algorithm EFLPM can be summarized as follows: Step 1: Read the DNA sequence dataset as a fasta file.
Step 2: Initialize the counter at 0 as the initial value of the while loop counter with count 0 and This will continue till this counter hits n-m.
Step Step 4: If the comparison yields a false result, the pattern does not exist in this section of the text since the initial and last letters did not match. However, if they match, there's a chance the pattern will match inside this text as well. Therefore, conduct the matching process immediately in this window, which is called the window.
Step 5: If this pattern and this section of the text are same, the complete pattern inside the text is identical. If there is a perfect match for the pattern in the text, increase the number of matches by one and return to the loop to finish the text.
Step 6: In the last step, if the loop is ended, returns the start index for all instances of pattern p in text t.  [6], in the text. At the beginning of the algorithm 1, p[0] and p [6] are aligned to t[0] and t [6], respectively. As a result, the window index array contains the start index of the 1'st window, i.e. 0. Following this example, the algorithm identifies eleven other windows. if the result is true, the next step checks the pattern with this window, and increase the match counter if matching occurs, and store the first index of this window in match_index,

Proposed EPAPM algorithm
The Enhance-Processor-Aware Pattern Matching (EPAPM) algorithm, which is based on PAPM, is described in this section. The comparison of pattern p characters and text t characters differs from the FLPM method. FLPM compares words with several characters, whereas PAPM compares characters. PAPM compares two words at the same time using a CPU's processing capacity. A bit processor's registers are slightly longer, and the processor can compare data from two registers during each execution cycle. The number of processable bytes (or word length) for this processor is computed as word_len = b/8 since each byte (or character) contains eight bits. It means that, the processor may compare one word to another by reading its registers each time. A 64-bit CPU, for example, might compare four words of eight characters.
We'll apply the same strategy we did in FLPM to reduce the time complexity of the pre-processing stage and match only one process that does the same job in less time in this approach. The EPAPM Algorithm Steps can be summarized as follows: Fig. 3 The pseudocode of the proposed algorithm EFLPM 1 3 Step 1: Read the DNA Sequence dataset as (Fasta file) Step 2: Initialize the counter at 0 as the initial value of the while loop counter and word_len by b/8 (Described in Section 4.2) and k by the modulus of m and word_len.
Step 3: The start index for the word comparison is determined at the start of this phase. Setting this start index ensures that the method runs successfully even if the lengths of the pattern and windows are not integer multiples of the word length. Step 4: This algorithm's while loop begins with count = 0 and continues until the counter reaches n-m.
Step 5: Check the two words based on word_len (a word can contain 4 or 8 characters).
Step 6: If the two words are matched, check all the patterns in the text.
Step 7: If all words of the pattern are matched in text, increase the number of matches by one and return to the loop to continue rest of text. Step 8: In the last step, returns the start index for all instances of pattern p in text t if the loop is finished.
We illustrate the Flowchart and Pseudocode for EPAPM Algorithm in Figs. 6 and 7. Figure 8 gives an example of using the EPAPM Algorithm run on a 32-bit processor. In this algorithm, the first word (consisting of the first 4 alphabets) of pattern   p is searched in text t. the window_index array is composed of three start indexes of the found windows, i.e., 25, 40 and 47. For this example, it should be noticed that the EFLPM method identifies 12 start indexes as potential intervals or windows in this case. As a result, EPAPM decreases the number of recognised windows. Because the remainder of pattern length over word length is 3 in the matching stage, the start index for matching is also 3. As a result, the second word of the pattern that corresponds to the second word of windows is CGTA. After this phase is completed, only one of the two windows (i.e., t[25.0.31]) is matched with the pattern (Figs. 9 and 10).

Experiential results
In this part, the performance of the suggested algorithms (EFLPM and EPAPM) is compared to that of the Boyer-Moore (BM), Horspool, Karp-Rabin, Zhu-Takaoka, d-BM, KMP, FLPM, and PAPM algorithms. The computer environment used to execute many simulated algorithms has the following specifications: Windows 10 Home 64 bit, 8 GB RAM, Intel(R) Core(TM) i7-7500U CPU 2.70 GHZ. The word length for the EPAPM and EFLPM algorithms was considered eight bytes due to the use of a 64-bit machine. The Python programming language was used to run the simulation. Each experiment involved searching the reference for ten patterns and reporting the average of the results. Because all tests used the HRG dataset, which is published in [43], each file had over 12 million characters, this time overhead was eliminated throughout the simulation. Table 3 shows the results of the performance evaluation of the simulated algorithms in terms of time cost. The rest of this section will give the findings of pattern matching algorithm comparisons with DNA sequences larger than 12 million characters and varied pattern sizes.  Table 3 compares and contrasts the algorithms created in this study with how patterns are matched to DNA sequences using various pattern matching algorithms. Table 3 compares the time it takes to complete the matching process using various methods from previous studies to the proposed algorithms. EFLPM and EPAPM are effective pattern matching algorithms that reduce time when compared to other techniques. This demonstrates that the pattern matching time optimization strategy we utilised was successful. EFLPM is the most effective and efficient method in terms of minimising the time required for various pattern sizes, according to the table, and it outperforms other algorithms, particularly FLPM. Furthermore, EPAPM yielded better outcomes than PAPM, which includes a pre-processing and matching phase. This shows that they are both superior in all pattern scaling matches. In a short length of time, this will solve the problem of matching large patterns with some algorithms and small patterns with others. This is especially advantageous for the expansion of biological data, which is constantly   of the strategies are ineffective or inaccurate, and none of them failed to find the pattern. As a result, the comparison's most important factor was time complexity. All approaches are 100% perfect efficient and accurate, and none of them failed to discover the pattern. As a result, the comparison's primary criterion was time complexity.
Despite the fact that time is an important part of the pattern matching process, we noticed that all of the algorithms used focus on efficiency without considering the time spent in the process. The speed of pattern finding in biological data was the emphasis of this work. As shown in Table 3, as compared to the prior algorithms, the speed of pattern detection increases dramatically. When this ratio was calculated, it was discovered that EFLPM had a 54% faster pattern detection speed than FLPM and EPAPM had a 39% faster pattern detection speed than PAPM. As a result, the proposed algorithms are faster and more efficient than traditional algorithms. Figure 11 shows the Pattern Matching Algorithms time complexity with the smallest pattern possible containing 4 characters only and search in DNA have sequence size more than 12 million characters, we show that EFLPM and EPAPM are the least time-consuming algorithms. Figure 12 shows the Pattern Matching Algorithms time complexity with the smallest pattern possible containing 10,000 characters only and search in DNA have sequence size more than 12 million characters, we also show that EFLPM is the least time-consuming algorithm. Figure 13 shows the Pattern Matching Algorithms time complexity with the smallest pattern possible containing 1,000,000 characters only and search in DNA have sequence size more than 12 million characters, we also show that EFLPM and EPAPM are the least time-consuming algorithms with large pattern size.
Finally, Fig. 14 shows the overall time cost of all algorithms. As seen in this diagram, EPAPM and EFLPM use word processing to significantly reduce the time required to complete pattern matching.

Discussion
In the Real-time world, problems need a quick algorithm with minimum error. Pattern matching is used in a wide range of applications, Pattern matching algorithms have many applications that cover a wide range including Pattern recognition, information retrieval, text processing, and DNA sequence analysis. Pattern matching will help to explore the right and appropriate result. There are many algorithms used to find pattern matching, we focused on DNA Sequences. Nowadays there are many algorithms are used for Pattern matching results. But we found that all the algorithms used focus on efficiency without looking at the time used in the pattern matching process, while the time taken is an effective factor in the matching process. Also, these algorithms consist of two stages, which increases the time spent. And because the time taken is a very influential factor now, especially since we are in the world of speed, we have worked to reduce the time spent with efficiency as well. This paper introduces two improved pattern matching algorithms specifically formulated to speed up searches on large DNA sequences. We also plan to create an efficient approach on real parallel processors in future research to reduce the amount of comparisons and attempts. and employ various techniques, such as machine learning and deep learning, to reduce the time and number of comparisons.
The effect of different factors, such as text size, RAM, and IDE, can be determined on string matching algorithms by designing a factorial model based on factorial design. The trend has already started with a new algorithm for DNS sequence matching by using an MPI technique. Another future direction is to implement string-matches algorithms in GPUs and FPGAs.
Arabic is the second most spoken language in the world after English. In Arabic, connected and unconnected words exist, which take considerable bytes and processing time. Development of multilingual exact matching algorithms with suitable encoding techniques is a promising and interesting future work. Analysis of memory requirements of existing string matching algorithms on heap during runtime is an interesting topic.

Conclusion and future work
In this paper, we presented two fast pattern matching algorithms which are EFLPM and EPAPM. The EFLPM and EPAPM algorithms are introduced in this study. The FLPM approach is a character-based pattern matching method, similar to previous studies, but the EPAPM method is a word processing method. The proposed algorithms outperform other simulated algorithms in terms of time cost, according to the outcomes of this study's experiments. Therefore, we noted that the accelerated time increased by 54% for EFLPM and 39% for the EPAPM algorithm. So, the proposed algorithms are quite applicable for pattern matching in biological sequences. This improvement is mostly due to the reduction in the number of detected windows and the consolidation of the pre-processing and matching steps into a single step. The presentation of a parallel version of current procedures, as well as the use of deep learning techniques in this field, will be the focus of future research. Furthermore, while this study focuses on algorithms that allow for exact pattern matching, future studies could focus on methods that allow for approximate pattern matching.