Abstract
Implementations of relational operators on GPU processors have resulted in order of magnitude speedups compared to their multicore CPU counterparts. Here we focus on the efficient implementation of string matching operators common in SQL queries. Due to different architectural features the optimal algorithm for CPUs might be suboptimal for GPUs. GPUs achieve high memory bandwidth by running thousands of threads, so it is not feasible to keep the working set of all threads in the cache in a naive implementation. In GPUs the unit of execution is a group of threads and in the presence of loops and branches, threads in a group have to follow the same execution path; if some threads diverge, then different paths are serialized. We study the cache memory efficiency of single- and multi-pattern string matching algorithms for conventional and pivoted string layouts in the GPU memory. We evaluate the memory efficiency in terms of memory access pattern and achieved memory bandwidth for different parallelization methods. To reduce thread divergence, we split string matching into multiple steps. We evaluate the different matching algorithms in terms of average- and worst-case performance and compare them against state-of-the-art CPU and GPU libraries. Our experimental evaluation shows that thread and memory efficiency affect performance significantly and that our proposed methods outperform previous CPU and GPU algorithms in terms of raw performance and power efficiency. The Knuth–Morris–Pratt algorithm is a good choice for GPUs because its regular memory access pattern makes it amenable to several GPU optimizations.
Similar content being viewed by others
Notes
When the L1 cache is turned on, 128 byte cache lines are sent from the L2. When L1 caching is off, 32 bytes are accessed at a time from the L2.
The higher frequency is a “boost” frequency that can only be used when there is power and temperature headroom.
This the maximum base frequency, not a boosted frequency that depends on power or temperature headroom.
Prices were taken on 05/08/2015 from amazon.com.
References
Aho, A.V., Corasick, M.J.: Efficient string matching: an aid to bibliographic search. Commun. ACM 18(6), 333–340 (1975)
Apostolico, A., Giancarlo, R.: The Boyer Moore Galil string searching strategies revisited. SIAM J. Comput. 15(1), 98–105 (1986). doi:10.1137/0215007
Bakkum, P., Chakradhar, S.: Efficient Data Management for GPU Databases. http://hgpu.org/?p=7180 (2012)
Bakkum, P., Skadron, K.: Accelerating SQL database operations on a GPU with CUDA. In: GPGPU (2010). doi:10.1145/1735688.1735706
Bellekens, X., Andonovic, I., Atkinson, R., Renfrew, C., Kirkham, T.: Investigation of GPU-based pattern matching. In: The 14th Annual Post Graduate Symposium on the Convergence of Telecommunications, Networking and Broadcasting (PGNet2013) (2013)
Bhargava, A., Kondrak, G.: Multiple word alignment with profile hidden Markov models. In: ACL, Companion Volume: Student Research Workshop and Doctoral Consortium, Association for Computational Linguistics, Boulder, Colorado, pp. 43–48. http://www.aclweb.org/anthology/N/N09/N09-3008 (2009)
Boost Library. http://www.boost.org/ (2014)
Boyer, R.S., Moore, J.S.: A fast string searching algorithm. Commun. ACM 20(10) (1977). doi:10.1145/359842.359859
Breß, S., Heimel, M., Siegmund, N., Bellatreche, L., Saake, G.: GPU-accelerated database systems: survey and open challenges. T Large Scale Data Knowl. Cent. Syst. 15, 1–35 (2014). doi:10.1007/978-3-662-45761-0_1
Carrillo, S., Siegel, J., Li, X.: A control-structure splitting optimization for GPGPU. In: CF ’09, pp. 147–150 (2009). doi:10.1145/1531743.1531766
Cascarano, N., Rolando, P., Risso, F., Sisto, R.: iNFAnt: NFA pattern matching on GPGPU devices. SIGCOMM Comput. Commun. Rev. 40(5), 20–26 (2010). doi:10.1145/1880153.1880157
Crochemore, M., Lecroq, T.: Pattern-matching and text-compression algorithms. ACM Comput. Surv. 28(1), 39–41 (1996). doi:10.1145/234313.234331
Dbpedia. http://wiki.dbpedia.org/Downloads2014 (2014)
Design and Analysis of Algorithms Lecture Notes. http://www.ics.uci.edu/~eppstein/161/960227.html (1996)
Diamos, G., Ashbaugh, B., Maiyuran, S., Kerr, A., Wu, H., Yalamanchili, S.: SIMD re-convergence at thread frontiers. In: MICRO (2011). doi:10.1145/2155620.2155676
Fang, R., He, B., Lu, M., Yang, K., Govindaraju, N.K., Luo, Q., Sander, P.V. GPUQP: query co-processing using graphics processors. In: SIGMOD, pp. 1061–1063 (2007)
Farivar, R., Kharbanda, H., Venkataraman, S., Campbell, R.: An algorithm for fast edit distance computation on GPUs. In: Innovative Parallel Computing (InPar), pp. 1–9 (2012). doi:10.1109/InPar.6339593
Ferragina, P., Manzini, G.: Indexing compressed text. J. ACM 52(4), 552–581 (2005). doi:10.1145/1082036.1082039
Fisk, M., Varghese, G.: Applying fast string matching to intrusion detection. Tech. rep., http://woozle.org/~mfisk/papers/setmatch-raid (2004)
Fung, W.W.L., Sham, I., Yuan, G., Aamodt, T.M. Dynamic warp formation and scheduling for efficient GPU control flow. In: MICRO (2007). doi:10.1109/MICRO.2007.12
Han, T.D., Abdelrahman, T.S.: Reducing branch divergence in GPU programs. In: GPGPU, pp. 3:1–3:8 (2011). doi:10.1145/1964179.1964184
Horspool, R.N.: Practical fast searching in strings. Softw. Pract. Exp. 10(6), 501–506 (1980). doi:10.1002/spe.4380100608
Hummel, M.: Parstream—A Parallel Database on GPUs. http://www.nvidia.com/content/gtc-2010/pdfs/4004a_gtc2010 (2010)
Intel 64 and IA-32 Architectures Software Developer’s Manual. http://download.intel.com/design/processor/manuals/253665 (2011)
Iorio, F., van Lunteren, J.: Fast pattern matching on the cell broadband engine, workshop on cell systems and applications. In: The 35th International Symposium on Computer Architecture (ISCA), Beijing, China (2008)
Jacob, N., Brodley, C.: Offloading IDS computation to the GPU. In: ACSAC, pp. 371–380 (2006). doi:10.1109/ACSAC.2006.35
Kaldewey, T., Lohman, G.M., Mueller, R., Volk, P.B.: GPU join processing revisited. In: DaMoN (2012)
Karkkainen, J., Ukkonen, E.: Sparse suffix trees. In: Cai, J.Y., Wong, C. (eds.) Computing and Combinatorics, LCNS, vol. 1090, pp. 219–230 (1996). doi:10.1007/3-540-61332-3_155
Knuth, D.E., Morris Jr, J.H., Pratt, V.R.: Fast pattern matching in strings. SIAM J. Comput. 6(2), 323–350 (1977)
Kouzinopoulos, C., Margaritis, K.: String matching on a multicore GPU using CUDA. In: PCI, pp. 14–18 (2009). doi:10.1109/PCI.2009.47
Li, J., Chen, S., Li, Y.: The fast evaluation of hidden Markov models on GPU. In: IEEE International Conference on Intelligent Computing and Intelligent Systems, 2009 (ICIS 2009), vol. 4, pp. 426–430 (2009)
Ligowski, L., Rudnicki, W.: An efficient implementation of Smith Waterman algorithm on GPU using CUDA, for massively parallel scanning of sequence databases. In: IEEE International Symposium on Parallel Distributed Processing, 2009 (IPDPS 2009), pp. 1–8 (2009). doi:10.1109/IPDPS.2009.5160931
Lin, K.J., Huang, Y.H., Lin, C.Y.: Efficient parallel knuth-morris-pratt algorithm for multi-GPUs with CUDA. In: Pan, J.S., Yang, C.N., Lin, C.C. (eds.) Advances in Intelligent Systems and Applications, vol. 21, pp. 543–552 (2013). doi:10.1007/978-3-642-35473-1_54
Lin, C.H., Tsai, S.Y., Liu, C.H., Chang, S.C., Shyu, J.M.: Accelerating string matching using multi-threaded algorithm on GPU. In: GLOBECOM, pp. 1–5 (2010). doi:10.1109/GLOCOM.2010.5683320
Lin, C.H., Liu, C.H., Chien, L.S., Chang, S.C.: Accelerating pattern matching using a novel parallel algorithm on GPUs. IEEE Trans. Comput. 62(10), 1906–1916 (2013). doi:10.1109/TC.2012.254
Liu, Y., Maskell, D., Schmidt, B.: CUDASW++: optimizing Smith–Waterman sequence database searches for CUDA-enabled graphics processing units. BMC Res. Notes 2(1), 73 (2009). doi:10.1186/1756-0500-2-73
Marziale III, L., Richard, G.G., Roussev, V.: Massive threading: using GPUs to increase the performance of digital forensics tools. Digit. Investig. 4, 73–81 (2007). doi:10.1016/j.diin.2007.06.014
Meng, J., Tarjan, D., Skadron, K.: Dynamic warp subdivision for integrated branch and memory divergence tolerance. SIGARCH Comput. Archit. News 38(3), 235–246 (2010). doi:10.1145/1816038.1815992
Mostak, T., Graham, T.: Map-D Data Redefined. http://on-demand.gputechconf.com/gtc/2014/webinar/gtc-express-map-d-webinar (2014)
Narasiman, V., Shebanow, M., Lee, C.J., Miftakhutdinov, R., Mutlu, O., Patt, Y.N.: Improving GPU performance via large warps and two-level warp scheduling. In: MICRO, pp. 308–317 (2011). doi:10.1145/2155620.2155656
Navarro, G.: A guided tour to approximate string matching. ACM Comput. Surv. 33(1), 31–88 (2001). doi:10.1145/375360.375365
Needleman, S.B., Wunsch, C.D.: A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48(3), 443–453 (1970). doi:10.1016/0022-2836(70)90057-4
Netzer, O.: Getting Big Data Done on a GPU-Based Database. http://on-demand.gputechconf.com/gtc/2014/presentations/S4644-big-data-gpu-based-database (2014)
Pirk, H., Manegold, S., Kersten, M.: Waste not...; efficient co-processing of relational data. In: 2014 IEEE 30th International Conference on Data Engineering (ICDE), pp. 508–519 (2014). doi:10.1109/ICDE.2014.6816677
Pyrgiotis, T., Kouzinopoulos, C., Margaritis, K.: Parallel implementation of the Wu–Manber algorithm using the OpenCL framework. Artif. Intell. Appl. Innov. 382, 576–583 (2012). doi:10.1007/978-3-642-33412-2_59
Rauhe, H., Dees, J., Sattler, K.U., Faerber, F.: Multi-level parallel query execution framework for CPU and GPU. In: Catania, B., Guerrini, G., Pokorny, J. (eds.) Advances in Databases and Information Systems, Lecture Notes in Computer Science, vol. 8133, pp. 330–343. Springer, Berlin (2013). doi:10.1007/978-3-642-40683-6_25
Re2 Regular Expression Library. http://code.google.com/p/re2/ (2014)
Sartori, J., Kumar, R.: Branch and data herding: reducing control and memory divergence for error-tolerant GPU applications. TMM 15(2), 279–290 (2013). doi:10.1109/TMM.2012.2232647
Scarpazza, D.P., Villa, O., Petrini, F.: Peak-performance DFA-based string matching on the Cell processor. In: IEEE International on Parallel and Distributed Processing Symposium, 2007 (IPDPS 2007). IEEE, pp. 1–8 (2007)
Sitaridi, E.A., Ross, K.A.: Optimizing select conditions on GPUs. In: Proceedings of the Ninth International Workshop on Data Management on New Hardware (DaMoN’13). ACM, New York, NY, USA, pp. 4:1–4:8 (2013). doi:10.1145/2485278.2485282
Smith, T., Waterman, M.: Identification of common molecular subsequences. J. Mol. Biol. 147(1), 195–197 (1981). doi:10.1016/0022-2836(81)90087-5
Sunday, D.M.: A very fast substring search algorithm. Commun. ACM 33(8), 132–142 (1990). doi:10.1145/79173.79184
Taylor, R., Li, X.: Software-based branch predication for AMD GPUs. SIGARCH Comput. Archit. News 38(4), 66–72 (2011). doi:10.1145/1926367.1926379
Tesla K80 GPU Accelerator. http://images.nvidia.com/content/pdf/kepler/Tesla-K80-BoardSpec-07317-001-v05 (2015)
Tian, Y., Tata, S., Hankins, R.A., Patel, J.M.: Practical methods for constructing suffix trees. VLDB J. 14(3), 281–299 (2005). doi:10.1007/s00778-005-0154-8
TPC-H Benchmark. http://www.tpc.org/tpch/ (2014)
Using Regular Expressions in Oracle Database. http://docs.oracle.com/cd/B19306_01/appdev.102/b14251/adfns_regexp.htm (2014)
Vasiliadis, G., Polychronakis, M., Ioannidis, S.: Parallelization and characterization of pattern matching using GPUs. In: IISWC, pp. 216–225 (2011). doi:10.1109/IISWC.2011.6114181
Weiner, P.: Linear pattern matching algorithms. In: Swat, IEEE Computer Society, pp. 1–11 (1973). doi:10.1109/SWAT.1973.13
Wu, H., Diamos, Gr., Sheard, T., Aref, M., Baxter, S., Garland, M., Yalamanchili, S.: Red Fox: an execution environment for relational query processing on GPUs. In: International Symposium on Code Generation and Optimization (CGO) (2014)
Yersinia Pestis Chromosome. ftp://ftp.sanger.ac.uk/pub/project/pathogens/yp/Yp.dna (2001)
Zhang, E.Z., Jiang, Y., Guo, Z., Shen, X.: Streamlining GPU applications on the fly: thread divergence elimination through runtime thread-data remapping. In: ICS (2010). doi:10.1145/1810085.1810104
Zhang, E.Z., Jiang, Y., Guo, Z., Tian, K., Shen, X.: On-the-fly elimination of dynamic irregularities for GPU computing. In: ASPLOS (2011). doi:10.1145/1950365.1950408
Zha, X., Sahni, S.: GPU-to-GPU and host-to-host multipattern string matching on a GPU. IEEE Trans. Comput. 62(6), 1156–1169 (2013). doi:10.1109/TC.2012.61
Zu, Y., Yang, M., Xu, Z., Wang, L., Tian, X., Peng, K., Dong, Q.: GPU-based NFA implementation for memory efficient high speed regular expression matching. PPoPP (2012). doi:10.1145/2145816.2145833
Zukowski, M.: Balancing Vectorized Query Execution with Bandwidth-Optimized Storage. PhD thesis, Universiteit van Amsterdam (2009)
Acknowledgments
This material is based upon work supported by National Science Foundation Grant IIS-1218222, an IBM Ph.D. Fellowship, an Onassis Foundation Scholarship, and by an equipment gift from Nvidia Corporation. We would like to thank MSc student Le Chang for the initial implementation of the code.
Author information
Authors and Affiliations
Corresponding author
Appendices
Appendix 1
Here, we show two additional examples of the segmentation and pivoting methods. The pattern being searched is ‘ATG’ and the input strings are 16 characters long.
Figure 20 shows the execution of the Pivot-4 KMP-Hybrid method. Different threads are processing different strings. Threads start scanning the first character of the first pivoted piece. In the internal loop of KMP-Hybrid the input is advanced either if the comparison of the current pattern character to the input succeeds or if it fails on the first character of the pattern (\(j==0\)). T1, T3 match their input in first iteration to the pattern. In the second iteration the comparison fails for both T1 and T3 so they have to consult the nxt table to shift the pattern. In the third iteration T1 and T3 will compare the second input character again after shifting the pattern. Threads synchronize again in the seventh iteration after all threads process the first piece and scan the first character of the second pivoted piece.
Figure 21 shows the execution of Seg-4-4 method: The segment size is four characters and four threads are processing concurrently each input string and two warps in total process the four input strings. Each thread, but the last in each group, is processing six characters in total to locate pattern occurrences that might occur between different segments.
Appendix 2
Figure 22 shows the long-term worse-case memory access behavior of KMP_Hybrid for a string of 512 characters. The behavior stabilizes after 128 characters (the length of the search pattern used in this worse-case example).
Appendix 3
Table 11 shows the results for the following subquery of TPC-H Q13.
GPU has more than 3x times better query performance and 2.42x less the energy consumption.
Rights and permissions
About this article
Cite this article
Sitaridi, E.A., Ross, K.A. GPU-accelerated string matching for database applications. The VLDB Journal 25, 719–740 (2016). https://doi.org/10.1007/s00778-015-0409-y
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00778-015-0409-y