Suffix Type String Matching Algorithms Based on Multi-windows and Integer Comparison

Fan, Hongbo; Shi, Shupeng; Zhang, Jing; Dong, Li

doi:10.1007/978-3-319-29814-6_35

Suffix Type String Matching Algorithms Based on Multi-windows and Integer Comparison

Hongbo Fan^17,18,
Shupeng Shi^17,18,
Jing Zhang^17,18 &
…
Li Dong^17,18

Conference paper
First Online: 05 March 2016

1347 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNSC,volume 9543))

Abstract

In this paper, 3 classic suffix type algorithms: QS, Tuned BM and BMHq were improved by reducing the average cost of basic operations. Firstly, the multi-windows method was used to let the calculations of the jump distance run in parallel and pipelining. Secondly, the comparison unit was increased to integer to reduce the total number and the average cost of comparisons. Especially for BMHq, the jump distance was increased by good prefix rule and the operations to get the jump distance were simplified by unaligned integer read. Thus, 3 algorithms named QSMI, TBMMI and BMHqMI were presented. These algorithms are faster than other known algorithms in many cases.

You have full access to this open access chapter, Download conference paper PDF

1 Introduction

String matching is the performance bottleneck and key issue in many important fields. The design of exact single pattern matching algorithm owns very important significance. Especially in our focus real-time information processing and security field, high performance matching is strongly demanded.

In the string \( S = s_{0} s_{1} \ldots s_{m - 1} \), for \( 0 < k \le m \), we denote the prefix, suffix and factor of S of length k as \( pref(S,k) \)/\( suff(S,k) \)/\( fac(S,k) \). SCW is used to denote the text in the slide window. All algorithms in this paper are belonging to exact single pattern matching algorithm, which means that for given alphabet Σ (|Σ=σ|, Σ* is closure of Σ), and for given text T = \( t_{0} t_{1} \ldots t_{n - 1} \) of length n/patten \( P = p_{0} p_{1} \ldots p_{m - 1} \) of length m, P, T ∈ Σ*, seeking the window that \( P[i] = SCW[i] \) for \( \forall i \in [0 \ldots m - 1] \) in all possible sliding window. Algorithms are described in C/C++.

This paper improved three classical suffix matching algorithms: Quick Search [1], Tuned BM [2] and BMHq [3]. We added Multi-window [4] and presented an integer comparison method in them. Thus the three series of algorithm named QSMI, TBMMI and BMHqMI were presented, and they are very fast for short patterns.

2 Accelerating Method: Multi-window and Integer Comparison

Multi-window [4] (shown in Fig. 1) let the text be equally divided as k/2 areas in k window mechanism (k is even). Each area has two windows and respectively matches from both ends toward the middle region until they are overlapped and each window matching procedure by tunes. It is a general accelerate method for string matching.

There are many compares in suffix marching. Let the delay of branch prediction failure be signed punishment. The average character compare branch cost is about \( 1 - \sigma^{ - 1} + \sigma^{ - 1} *punishment \), e.g., 10.5 ticks on DNA sequence on Prescott. If unaligned read \( pref(SCW,w) \) into an integer, compare with the integer of \( pref(P,w) \). Only when they are equal, other compares are needed. One integer comparison is equivalent to \( w \) times of character comparison and the average cost of branch is reduced to \( 1 - \sigma^{ - w} + \sigma^{ - w} *punishment \). To compare uint16_t/uint32_t on Prescott and DNA sequence, the average branch cost will obviously reduce to 3.27/1.15 ticks.

3 Improved Algorithms Based on QS, Tuned BM and BMHq

By introducing above method into QS [1], a new algorithm called QSMI_wkXc was presented, which k is the number of windows and X is the integer type for comparison: S:short/uint16_t, I:int/uint32_t, L:long long/uint64_t. The code of QSMI_w4Ic is listed as Algorithm 1.

By introducing continuous jump method of Tuned BM into QSMI, TBMMI was proposed. Firstly, determine whether the window match occurs by integer comparison in the each window. And then, bad character jumping of Quick Search continuous jump once and bad character jumping of Horspool jump several times. We use once QS jump and twice Horspool jump twice in the TBMMI. TBMMI_w4Ic that is obtained only by the bad character jump table of QS are shown as Algorithm 2.

We improved BMHq [3], by using good-prefix rule to increase the jump distance, unaligned read to reduce read operation and add the method in Sect. 3, an algorithm named BMHqMI was proposed. BMH2MI_w4Ic is shown in Algorithm 3.

When \( suff(SCW,q - 1) \notin fac(P) \), BMHq make the window slide from the win0 to the win1 show as Fig. 2. If \( suff(SCW,q - 1) \ne pref(P,q - 1) \) and win1 can not matching. So the window should keep sliding until find the first k satisfy \( suff(SCW,k) = pref(P,k) \) (the window get extra jump to the win2).

To store the jump distance for q-grams needs q-Dimension table, which a table lookup need q times read. Unaligned read can simulate original q-Dimensional table lookup by once read and table lookup. Since on little-endian processor, *(uint16_t*)(T + i + m − 2) = \( T[i + m - 2] \) + \( b = T[i + m - 1] \) * 256. If the 2-Dimensional jump distance table is shift, build a 1-Dimensional table shift1D and for \( \forall a,b \in \varSigma \), \( shift1D[a + b*256] = shift[a][b] \). So, \( shift1D[*({\text{uint\_16}}*)(T + i + m - 2)] \) \( = shift[a][b] \). If the read string is T[i + m – 2 … i + m] for q = 3, * (uint32_t*)(T + i+m-2) &0x00ffffffu = T[i + m − 2] + (int)T[i + m − 1] * 256 + (int)T[i + m] * 65536 can be used.

4 Experiment and Results

We did the following experiment based on SMART 13.02 [6], it gave the implements of most known algorithms (in EI or SCI paper) as of Feb. 2013. The platform of this experiment is Intel Core2 E3400 @ 3.0 GHz/Ubuntu 12.10 64 bit desktop/g++4.6/-O3 optimization. The tested texts include three samples of text [8] listed as follow: DNA sequence (E.coil), pure English text (Bible.txt) and the sample of English nature language (world192.txt). This experiment compared all algorithms in SMART 13.02 and added some newer algorithms not be included in SMART, such as SBNDMqb [9], GSB2b [9], FSO [10], HGQSkip [11], kSWxC [12], SufOM [13], Greedy-QF [14], etc. If an algorithm with different parameters are called different algorithms, there were more than 1000 algorithms are compared, which covered most of known algorithms. The experiment data (dozens of thousands of records) can not be listed all. In this paper only list the highest performance of three algorithms under each match condition. The data of experiment show as Table 1 and the unit is MB/s.

Table 1. Matching speed of the fastest 3 algorithms and their optimal parameters

Full size table

5 Conclusion

In this paper, three classical suffix match algorithms QS/TBM/BMHq are improved by introduce the method of Multi-window and unaligned read integer comparison, and three suffix match algorithms named QSMI/TBMMI/BMHqMI were proposed. It is shown in experiment results that these algorithms are faster than other known algorithm under multiple match conditions for matching short patterns.

6 Acknowledgements

This paper is supported by National Natural Science Foundation of Yunnan, China under Grant 2012FB131 and 2012FB137, Key Project of National Natural Science Foundation of Yunnan, China under Grant 2014FA029, and National Natural Science Foundation of China under Grant 61562051.

References

Daniel, M.S.: A very fast substring search algorithm. Commun. ACM 33(8), 132–142 (1990)
Article Google Scholar
Andrew, H.: Fast string searching. Softw. Pract. Exp. 21(11), 1221–1248 (1991)
Article Google Scholar
Kalsi, P., Hannu, P., Jorma, T.: Comparison of exact string matching algorithms for biological sequences. BIRD 2008, pp. 417–426. Springer, Berlin (2008)
Google Scholar
Faro, S., Lecroq, T.: A multiple sliding windows approach to speed up string matching algorithms. In: Klasing, R. (ed.) SEA 2012. LNCS, vol. 7276, pp. 172–183. Springer, Heidelberg (2012)
Chapter Google Scholar
Horspool, R.N.: Practical fast searching in strings. Softw. Pract. Exp. 10(6), 501–506 (1980)
Article Google Scholar
SMART: string matching research tools. http://www.dmi.unict.it/~faro/smart/
Simone, F., Simone, F., Thierry, L.: The exact online string matching problem: a review of the most recent results. ACM Comput. Surv. 45(2), 13:1–13:42 (2013)
MATH Google Scholar
The large canterbury corpus. http://corpus.canterbury.ac.nz/descriptions/
Hannu, P., Jorma, T.: Variations of forward-SBNDM. In: PSC2011, pp. 3–14. Czech Technical University, Prague (2011)
Google Scholar
Fredriksson, K., Grabowski, S.: Practical and optimal string matching. In: Consens, M.P., Navarro, G. (eds.) SPIRE 2005. LNCS, vol. 3772, pp. 376–387. Springer, Heidelberg (2005)
Chapter Google Scholar
Wu, W., Fan, H., Liu, L., Huang, Q.: Fast string matching algorithm based on the skip algorithm. ICM 2012. LNEE, vol. 236, pp. 247–257. Springer, New York (2013)
Chapter Google Scholar
Lv, Z., Fan, H., Liu, L., Huang, Q., et al.: Fast single pattern string matching algorithms based on multi-windows and integer comparison. In: IET International Conference on ICISCE 2012, pp. 1–5 (2012). doi:10.1049/cp.2012.2326)
Fan, H., Yao, N.: Tuning the EBOM algorithm with suffix jump. ICITSE 2012. LNEE, vol. 211, pp. 965–973 (2013)
Google Scholar
Chen, Z., Liu, L., Fan, H., Huang, Q., et al..: A fast exact string matching algorithms based on greedy jump and QF. ICISCE 2012. In: IET International Conference (2012). doi:10.1049/cp.2012.2320
Fan, H., Yao, N.: Q-gram variation for EBOM. In: Proceedings of the 2012 International Conference on Information Technology and Software Engineering. LNEE, vol. 211, pp. 453–460 (2013)
Google Scholar
Branislav, D., Jan, H., Hannu, P., Jorna T.: Tuning BNDM with q-grams. In: ALENEX 2009, pp. 29–37. SIAM, New York (2009)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, Kunming University of Science and Technology, Kunming, 650500, China
Hongbo Fan, Shupeng Shi, Jing Zhang & Li Dong
Computer Technology Application Key Laboratory of Yunnan, Kunming, 650500, Yunnan, China
Hongbo Fan, Shupeng Shi, Jing Zhang & Li Dong

Authors

Hongbo Fan
View author publications
You can also search for this author in PubMed Google Scholar
Shupeng Shi
View author publications
You can also search for this author in PubMed Google Scholar
Jing Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Li Dong
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jing Zhang .

Editor information

Editors and Affiliations

Institute of Information Engineering, Chinese Academy of Science, Beijing, China
Sihan Qing
Graduate School of Systems and Information Engineering, University of Tsukuba, Tsukuba, Japan
Eiji Okamoto
School of Computing, KAIST, Daejeon, Korea (Republic of)
Kwangjo Kim
Westone Corporation, Beijing, China
Dongmei Liu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Fan, H., Shi, S., Zhang, J., Dong, L. (2016). Suffix Type String Matching Algorithms Based on Multi-windows and Integer Comparison. In: Qing, S., Okamoto, E., Kim, K., Liu, D. (eds) Information and Communications Security. ICICS 2015. Lecture Notes in Computer Science(), vol 9543. Springer, Cham. https://doi.org/10.1007/978-3-319-29814-6_35

Download citation

DOI: https://doi.org/10.1007/978-3-319-29814-6_35
Published: 05 March 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-29813-9
Online ISBN: 978-3-319-29814-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics