Abstract
In this paper, 3 classic suffix type algorithms: QS, Tuned BM and BMHq were improved by reducing the average cost of basic operations. Firstly, the multi-windows method was used to let the calculations of the jump distance run in parallel and pipelining. Secondly, the comparison unit was increased to integer to reduce the total number and the average cost of comparisons. Especially for BMHq, the jump distance was increased by good prefix rule and the operations to get the jump distance were simplified by unaligned integer read. Thus, 3 algorithms named QSMI, TBMMI and BMHqMI were presented. These algorithms are faster than other known algorithms in many cases.
You have full access to this open access chapter, Download conference paper PDF
1 Introduction
String matching is the performance bottleneck and key issue in many important fields. The design of exact single pattern matching algorithm owns very important significance. Especially in our focus real-time information processing and security field, high performance matching is strongly demanded.
In the string \( S = s_{0} s_{1} \ldots s_{m - 1} \), for \( 0 < k \le m \), we denote the prefix, suffix and factor of S of length k as \( pref(S,k) \)/\( suff(S,k) \)/\( fac(S,k) \). SCW is used to denote the text in the slide window. All algorithms in this paper are belonging to exact single pattern matching algorithm, which means that for given alphabet Σ (|Σ=σ|, Σ* is closure of Σ), and for given text T = \( t_{0} t_{1} \ldots t_{n - 1} \) of length n/patten \( P = p_{0} p_{1} \ldots p_{m - 1} \) of length m, P, T ∈ Σ*, seeking the window that \( P[i] = SCW[i] \) for \( \forall i \in [0 \ldots m - 1] \) in all possible sliding window. Algorithms are described in C/C++.
This paper improved three classical suffix matching algorithms: Quick Search [1], Tuned BM [2] and BMHq [3]. We added Multi-window [4] and presented an integer comparison method in them. Thus the three series of algorithm named QSMI, TBMMI and BMHqMI were presented, and they are very fast for short patterns.
2 Accelerating Method: Multi-window and Integer Comparison
Multi-window [4] (shown in Fig. 1) let the text be equally divided as k/2 areas in k window mechanism (k is even). Each area has two windows and respectively matches from both ends toward the middle region until they are overlapped and each window matching procedure by tunes. It is a general accelerate method for string matching.
There are many compares in suffix marching. Let the delay of branch prediction failure be signed punishment. The average character compare branch cost is about \( 1 - \sigma^{ - 1} + \sigma^{ - 1} *punishment \), e.g., 10.5 ticks on DNA sequence on Prescott. If unaligned read \( pref(SCW,w) \) into an integer, compare with the integer of \( pref(P,w) \). Only when they are equal, other compares are needed. One integer comparison is equivalent to \( w \) times of character comparison and the average cost of branch is reduced to \( 1 - \sigma^{ - w} + \sigma^{ - w} *punishment \). To compare uint16_t/uint32_t on Prescott and DNA sequence, the average branch cost will obviously reduce to 3.27/1.15 ticks.
3 Improved Algorithms Based on QS, Tuned BM and BMHq
By introducing above method into QS [1], a new algorithm called QSMI_wkXc was presented, which k is the number of windows and X is the integer type for comparison: S:short/uint16_t, I:int/uint32_t, L:long long/uint64_t. The code of QSMI_w4Ic is listed as Algorithm 1.
By introducing continuous jump method of Tuned BM into QSMI, TBMMI was proposed. Firstly, determine whether the window match occurs by integer comparison in the each window. And then, bad character jumping of Quick Search continuous jump once and bad character jumping of Horspool jump several times. We use once QS jump and twice Horspool jump twice in the TBMMI. TBMMI_w4Ic that is obtained only by the bad character jump table of QS are shown as Algorithm 2.
We improved BMHq [3], by using good-prefix rule to increase the jump distance, unaligned read to reduce read operation and add the method in Sect. 3, an algorithm named BMHqMI was proposed. BMH2MI_w4Ic is shown in Algorithm 3.
When \( suff(SCW,q - 1) \notin fac(P) \), BMHq make the window slide from the win0 to the win1 show as Fig. 2. If \( suff(SCW,q - 1) \ne pref(P,q - 1) \) and win1 can not matching. So the window should keep sliding until find the first k satisfy \( suff(SCW,k) = pref(P,k) \) (the window get extra jump to the win2).
To store the jump distance for q-grams needs q-Dimension table, which a table lookup need q times read. Unaligned read can simulate original q-Dimensional table lookup by once read and table lookup. Since on little-endian processor, *(uint16_t*)(T + i + m − 2) = \( T[i + m - 2] \) + \( b = T[i + m - 1] \) * 256. If the 2-Dimensional jump distance table is shift, build a 1-Dimensional table shift1D and for \( \forall a,b \in \varSigma \), \( shift1D[a + b*256] = shift[a][b] \). So, \( shift1D[*({\text{uint\_16}}*)(T + i + m - 2)] \) \( = shift[a][b] \). If the read string is T[i + m – 2 … i + m] for q = 3, * (uint32_t*)(T + i+m-2) &0x00ffffffu = T[i + m − 2] + (int)T[i + m − 1] * 256 + (int)T[i + m] * 65536 can be used.
4 Experiment and Results
We did the following experiment based on SMART 13.02 [6], it gave the implements of most known algorithms (in EI or SCI paper) as of Feb. 2013. The platform of this experiment is Intel Core2 E3400 @ 3.0 GHz/Ubuntu 12.10 64 bit desktop/g++4.6/-O3 optimization. The tested texts include three samples of text [8] listed as follow: DNA sequence (E.coil), pure English text (Bible.txt) and the sample of English nature language (world192.txt). This experiment compared all algorithms in SMART 13.02 and added some newer algorithms not be included in SMART, such as SBNDMqb [9], GSB2b [9], FSO [10], HGQSkip [11], kSWxC [12], SufOM [13], Greedy-QF [14], etc. If an algorithm with different parameters are called different algorithms, there were more than 1000 algorithms are compared, which covered most of known algorithms. The experiment data (dozens of thousands of records) can not be listed all. In this paper only list the highest performance of three algorithms under each match condition. The data of experiment show as Table 1 and the unit is MB/s.
5 Conclusion
In this paper, three classical suffix match algorithms QS/TBM/BMHq are improved by introduce the method of Multi-window and unaligned read integer comparison, and three suffix match algorithms named QSMI/TBMMI/BMHqMI were proposed. It is shown in experiment results that these algorithms are faster than other known algorithm under multiple match conditions for matching short patterns.
6 Acknowledgements
This paper is supported by National Natural Science Foundation of Yunnan, China under Grant 2012FB131 and 2012FB137, Key Project of National Natural Science Foundation of Yunnan, China under Grant 2014FA029, and National Natural Science Foundation of China under Grant 61562051.
References
Daniel, M.S.: A very fast substring search algorithm. Commun. ACM 33(8), 132–142 (1990)
Andrew, H.: Fast string searching. Softw. Pract. Exp. 21(11), 1221–1248 (1991)
Kalsi, P., Hannu, P., Jorma, T.: Comparison of exact string matching algorithms for biological sequences. BIRD 2008, pp. 417–426. Springer, Berlin (2008)
Faro, S., Lecroq, T.: A multiple sliding windows approach to speed up string matching algorithms. In: Klasing, R. (ed.) SEA 2012. LNCS, vol. 7276, pp. 172–183. Springer, Heidelberg (2012)
Horspool, R.N.: Practical fast searching in strings. Softw. Pract. Exp. 10(6), 501–506 (1980)
SMART: string matching research tools. http://www.dmi.unict.it/~faro/smart/
Simone, F., Simone, F., Thierry, L.: The exact online string matching problem: a review of the most recent results. ACM Comput. Surv. 45(2), 13:1–13:42 (2013)
The large canterbury corpus. http://corpus.canterbury.ac.nz/descriptions/
Hannu, P., Jorma, T.: Variations of forward-SBNDM. In: PSC2011, pp. 3–14. Czech Technical University, Prague (2011)
Fredriksson, K., Grabowski, S.: Practical and optimal string matching. In: Consens, M.P., Navarro, G. (eds.) SPIRE 2005. LNCS, vol. 3772, pp. 376–387. Springer, Heidelberg (2005)
Wu, W., Fan, H., Liu, L., Huang, Q.: Fast string matching algorithm based on the skip algorithm. ICM 2012. LNEE, vol. 236, pp. 247–257. Springer, New York (2013)
Lv, Z., Fan, H., Liu, L., Huang, Q., et al.: Fast single pattern string matching algorithms based on multi-windows and integer comparison. In: IET International Conference on ICISCE 2012, pp. 1–5 (2012). doi:10.1049/cp.2012.2326)
Fan, H., Yao, N.: Tuning the EBOM algorithm with suffix jump. ICITSE 2012. LNEE, vol. 211, pp. 965–973 (2013)
Chen, Z., Liu, L., Fan, H., Huang, Q., et al..: A fast exact string matching algorithms based on greedy jump and QF. ICISCE 2012. In: IET International Conference (2012). doi:10.1049/cp.2012.2320
Fan, H., Yao, N.: Q-gram variation for EBOM. In: Proceedings of the 2012 International Conference on Information Technology and Software Engineering. LNEE, vol. 211, pp. 453–460 (2013)
Branislav, D., Jan, H., Hannu, P., Jorna T.: Tuning BNDM with q-grams. In: ALENEX 2009, pp. 29–37. SIAM, New York (2009)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Fan, H., Shi, S., Zhang, J., Dong, L. (2016). Suffix Type String Matching Algorithms Based on Multi-windows and Integer Comparison. In: Qing, S., Okamoto, E., Kim, K., Liu, D. (eds) Information and Communications Security. ICICS 2015. Lecture Notes in Computer Science(), vol 9543. Springer, Cham. https://doi.org/10.1007/978-3-319-29814-6_35
Download citation
DOI: https://doi.org/10.1007/978-3-319-29814-6_35
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-29813-9
Online ISBN: 978-3-319-29814-6
eBook Packages: Computer ScienceComputer Science (R0)