Skip to main content
Log in

GPU-accelerated string matching for database applications

  • Special Issue Paper
  • Published:
The VLDB Journal Aims and scope Submit manuscript

Abstract

Implementations of relational operators on GPU processors have resulted in order of magnitude speedups compared to their multicore CPU counterparts. Here we focus on the efficient implementation of string matching operators common in SQL queries. Due to different architectural features the optimal algorithm for CPUs might be suboptimal for GPUs. GPUs achieve high memory bandwidth by running thousands of threads, so it is not feasible to keep the working set of all threads in the cache in a naive implementation. In GPUs the unit of execution is a group of threads and in the presence of loops and branches, threads in a group have to follow the same execution path; if some threads diverge, then different paths are serialized. We study the cache memory efficiency of single- and multi-pattern string matching algorithms for conventional and pivoted string layouts in the GPU memory. We evaluate the memory efficiency in terms of memory access pattern and achieved memory bandwidth for different parallelization methods. To reduce thread divergence, we split string matching into multiple steps. We evaluate the different matching algorithms in terms of average- and worst-case performance and compare them against state-of-the-art CPU and GPU libraries. Our experimental evaluation shows that thread and memory efficiency affect performance significantly and that our proposed methods outperform previous CPU and GPU algorithms in terms of raw performance and power efficiency. The Knuth–Morris–Pratt algorithm is a good choice for GPUs because its regular memory access pattern makes it amenable to several GPU optimizations.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19

Similar content being viewed by others

Notes

  1. To the best of our knowledge, string pivoting has been suggested only for fixed 1-byte or 4-byte units [25, 49, 65], without studying the impact on cache behavior.

  2. When the L1 cache is turned on, 128 byte cache lines are sent from the L2. When L1 caching is off, 32 bytes are accessed at a time from the L2.

  3. The higher frequency is a “boost” frequency that can only be used when there is power and temperature headroom.

  4. This the maximum base frequency, not a boosted frequency that depends on power or temperature headroom.

  5. Prices were taken on 05/08/2015 from amazon.com.

References

  1. Aho, A.V., Corasick, M.J.: Efficient string matching: an aid to bibliographic search. Commun. ACM 18(6), 333–340 (1975)

    Article  MathSciNet  MATH  Google Scholar 

  2. Apostolico, A., Giancarlo, R.: The Boyer Moore Galil string searching strategies revisited. SIAM J. Comput. 15(1), 98–105 (1986). doi:10.1137/0215007

    Article  MathSciNet  MATH  Google Scholar 

  3. Bakkum, P., Chakradhar, S.: Efficient Data Management for GPU Databases. http://hgpu.org/?p=7180 (2012)

  4. Bakkum, P., Skadron, K.: Accelerating SQL database operations on a GPU with CUDA. In: GPGPU (2010). doi:10.1145/1735688.1735706

  5. Bellekens, X., Andonovic, I., Atkinson, R., Renfrew, C., Kirkham, T.: Investigation of GPU-based pattern matching. In: The 14th Annual Post Graduate Symposium on the Convergence of Telecommunications, Networking and Broadcasting (PGNet2013) (2013)

  6. Bhargava, A., Kondrak, G.: Multiple word alignment with profile hidden Markov models. In: ACL, Companion Volume: Student Research Workshop and Doctoral Consortium, Association for Computational Linguistics, Boulder, Colorado, pp. 43–48. http://www.aclweb.org/anthology/N/N09/N09-3008 (2009)

  7. Boost Library. http://www.boost.org/ (2014)

  8. Boyer, R.S., Moore, J.S.: A fast string searching algorithm. Commun. ACM 20(10) (1977). doi:10.1145/359842.359859

  9. Breß, S., Heimel, M., Siegmund, N., Bellatreche, L., Saake, G.: GPU-accelerated database systems: survey and open challenges. T Large Scale Data Knowl. Cent. Syst. 15, 1–35 (2014). doi:10.1007/978-3-662-45761-0_1

  10. Carrillo, S., Siegel, J., Li, X.: A control-structure splitting optimization for GPGPU. In: CF ’09, pp. 147–150 (2009). doi:10.1145/1531743.1531766

  11. Cascarano, N., Rolando, P., Risso, F., Sisto, R.: iNFAnt: NFA pattern matching on GPGPU devices. SIGCOMM Comput. Commun. Rev. 40(5), 20–26 (2010). doi:10.1145/1880153.1880157

    Article  Google Scholar 

  12. Crochemore, M., Lecroq, T.: Pattern-matching and text-compression algorithms. ACM Comput. Surv. 28(1), 39–41 (1996). doi:10.1145/234313.234331

    Article  Google Scholar 

  13. Dbpedia. http://wiki.dbpedia.org/Downloads2014 (2014)

  14. Design and Analysis of Algorithms Lecture Notes. http://www.ics.uci.edu/~eppstein/161/960227.html (1996)

  15. Diamos, G., Ashbaugh, B., Maiyuran, S., Kerr, A., Wu, H., Yalamanchili, S.: SIMD re-convergence at thread frontiers. In: MICRO (2011). doi:10.1145/2155620.2155676

  16. Fang, R., He, B., Lu, M., Yang, K., Govindaraju, N.K., Luo, Q., Sander, P.V. GPUQP: query co-processing using graphics processors. In: SIGMOD, pp. 1061–1063 (2007)

  17. Farivar, R., Kharbanda, H., Venkataraman, S., Campbell, R.: An algorithm for fast edit distance computation on GPUs. In: Innovative Parallel Computing (InPar), pp. 1–9 (2012). doi:10.1109/InPar.6339593

  18. Ferragina, P., Manzini, G.: Indexing compressed text. J. ACM 52(4), 552–581 (2005). doi:10.1145/1082036.1082039

    Article  MathSciNet  MATH  Google Scholar 

  19. Fisk, M., Varghese, G.: Applying fast string matching to intrusion detection. Tech. rep., http://woozle.org/~mfisk/papers/setmatch-raid (2004)

  20. Fung, W.W.L., Sham, I., Yuan, G., Aamodt, T.M. Dynamic warp formation and scheduling for efficient GPU control flow. In: MICRO (2007). doi:10.1109/MICRO.2007.12

  21. Han, T.D., Abdelrahman, T.S.: Reducing branch divergence in GPU programs. In: GPGPU, pp. 3:1–3:8 (2011). doi:10.1145/1964179.1964184

  22. Horspool, R.N.: Practical fast searching in strings. Softw. Pract. Exp. 10(6), 501–506 (1980). doi:10.1002/spe.4380100608

    Article  Google Scholar 

  23. Hummel, M.: Parstream—A Parallel Database on GPUs. http://www.nvidia.com/content/gtc-2010/pdfs/4004a_gtc2010 (2010)

  24. Intel 64 and IA-32 Architectures Software Developer’s Manual. http://download.intel.com/design/processor/manuals/253665 (2011)

  25. Iorio, F., van Lunteren, J.: Fast pattern matching on the cell broadband engine, workshop on cell systems and applications. In: The 35th International Symposium on Computer Architecture (ISCA), Beijing, China (2008)

  26. Jacob, N., Brodley, C.: Offloading IDS computation to the GPU. In: ACSAC, pp. 371–380 (2006). doi:10.1109/ACSAC.2006.35

  27. Kaldewey, T., Lohman, G.M., Mueller, R., Volk, P.B.: GPU join processing revisited. In: DaMoN (2012)

  28. Karkkainen, J., Ukkonen, E.: Sparse suffix trees. In: Cai, J.Y., Wong, C. (eds.) Computing and Combinatorics, LCNS, vol. 1090, pp. 219–230 (1996). doi:10.1007/3-540-61332-3_155

  29. Knuth, D.E., Morris Jr, J.H., Pratt, V.R.: Fast pattern matching in strings. SIAM J. Comput. 6(2), 323–350 (1977)

    Article  MathSciNet  MATH  Google Scholar 

  30. Kouzinopoulos, C., Margaritis, K.: String matching on a multicore GPU using CUDA. In: PCI, pp. 14–18 (2009). doi:10.1109/PCI.2009.47

  31. Li, J., Chen, S., Li, Y.: The fast evaluation of hidden Markov models on GPU. In: IEEE International Conference on Intelligent Computing and Intelligent Systems, 2009 (ICIS 2009), vol. 4, pp. 426–430 (2009)

  32. Ligowski, L., Rudnicki, W.: An efficient implementation of Smith Waterman algorithm on GPU using CUDA, for massively parallel scanning of sequence databases. In: IEEE International Symposium on Parallel Distributed Processing, 2009 (IPDPS 2009), pp. 1–8 (2009). doi:10.1109/IPDPS.2009.5160931

  33. Lin, K.J., Huang, Y.H., Lin, C.Y.: Efficient parallel knuth-morris-pratt algorithm for multi-GPUs with CUDA. In: Pan, J.S., Yang, C.N., Lin, C.C. (eds.) Advances in Intelligent Systems and Applications, vol. 21, pp. 543–552 (2013). doi:10.1007/978-3-642-35473-1_54

  34. Lin, C.H., Tsai, S.Y., Liu, C.H., Chang, S.C., Shyu, J.M.: Accelerating string matching using multi-threaded algorithm on GPU. In: GLOBECOM, pp. 1–5 (2010). doi:10.1109/GLOCOM.2010.5683320

  35. Lin, C.H., Liu, C.H., Chien, L.S., Chang, S.C.: Accelerating pattern matching using a novel parallel algorithm on GPUs. IEEE Trans. Comput. 62(10), 1906–1916 (2013). doi:10.1109/TC.2012.254

    Article  MathSciNet  Google Scholar 

  36. Liu, Y., Maskell, D., Schmidt, B.: CUDASW++: optimizing Smith–Waterman sequence database searches for CUDA-enabled graphics processing units. BMC Res. Notes 2(1), 73 (2009). doi:10.1186/1756-0500-2-73

    Article  Google Scholar 

  37. Marziale III, L., Richard, G.G., Roussev, V.: Massive threading: using GPUs to increase the performance of digital forensics tools. Digit. Investig. 4, 73–81 (2007). doi:10.1016/j.diin.2007.06.014

    Article  Google Scholar 

  38. Meng, J., Tarjan, D., Skadron, K.: Dynamic warp subdivision for integrated branch and memory divergence tolerance. SIGARCH Comput. Archit. News 38(3), 235–246 (2010). doi:10.1145/1816038.1815992

    Article  Google Scholar 

  39. Mostak, T., Graham, T.: Map-D Data Redefined. http://on-demand.gputechconf.com/gtc/2014/webinar/gtc-express-map-d-webinar (2014)

  40. Narasiman, V., Shebanow, M., Lee, C.J., Miftakhutdinov, R., Mutlu, O., Patt, Y.N.: Improving GPU performance via large warps and two-level warp scheduling. In: MICRO, pp. 308–317 (2011). doi:10.1145/2155620.2155656

  41. Navarro, G.: A guided tour to approximate string matching. ACM Comput. Surv. 33(1), 31–88 (2001). doi:10.1145/375360.375365

    Article  Google Scholar 

  42. Needleman, S.B., Wunsch, C.D.: A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48(3), 443–453 (1970). doi:10.1016/0022-2836(70)90057-4

    Article  Google Scholar 

  43. Netzer, O.: Getting Big Data Done on a GPU-Based Database. http://on-demand.gputechconf.com/gtc/2014/presentations/S4644-big-data-gpu-based-database (2014)

  44. Pirk, H., Manegold, S., Kersten, M.: Waste not...; efficient co-processing of relational data. In: 2014 IEEE 30th International Conference on Data Engineering (ICDE), pp. 508–519 (2014). doi:10.1109/ICDE.2014.6816677

  45. Pyrgiotis, T., Kouzinopoulos, C., Margaritis, K.: Parallel implementation of the Wu–Manber algorithm using the OpenCL framework. Artif. Intell. Appl. Innov. 382, 576–583 (2012). doi:10.1007/978-3-642-33412-2_59

  46. Rauhe, H., Dees, J., Sattler, K.U., Faerber, F.: Multi-level parallel query execution framework for CPU and GPU. In: Catania, B., Guerrini, G., Pokorny, J. (eds.) Advances in Databases and Information Systems, Lecture Notes in Computer Science, vol. 8133, pp. 330–343. Springer, Berlin (2013). doi:10.1007/978-3-642-40683-6_25

  47. Re2 Regular Expression Library. http://code.google.com/p/re2/ (2014)

  48. Sartori, J., Kumar, R.: Branch and data herding: reducing control and memory divergence for error-tolerant GPU applications. TMM 15(2), 279–290 (2013). doi:10.1109/TMM.2012.2232647

    Google Scholar 

  49. Scarpazza, D.P., Villa, O., Petrini, F.: Peak-performance DFA-based string matching on the Cell processor. In: IEEE International on Parallel and Distributed Processing Symposium, 2007 (IPDPS 2007). IEEE, pp. 1–8 (2007)

  50. Sitaridi, E.A., Ross, K.A.: Optimizing select conditions on GPUs. In: Proceedings of the Ninth International Workshop on Data Management on New Hardware (DaMoN’13). ACM, New York, NY, USA, pp. 4:1–4:8 (2013). doi:10.1145/2485278.2485282

  51. Smith, T., Waterman, M.: Identification of common molecular subsequences. J. Mol. Biol. 147(1), 195–197 (1981). doi:10.1016/0022-2836(81)90087-5

  52. Sunday, D.M.: A very fast substring search algorithm. Commun. ACM 33(8), 132–142 (1990). doi:10.1145/79173.79184

  53. Taylor, R., Li, X.: Software-based branch predication for AMD GPUs. SIGARCH Comput. Archit. News 38(4), 66–72 (2011). doi:10.1145/1926367.1926379

    Article  MathSciNet  Google Scholar 

  54. Tesla K80 GPU Accelerator. http://images.nvidia.com/content/pdf/kepler/Tesla-K80-BoardSpec-07317-001-v05 (2015)

  55. Tian, Y., Tata, S., Hankins, R.A., Patel, J.M.: Practical methods for constructing suffix trees. VLDB J. 14(3), 281–299 (2005). doi:10.1007/s00778-005-0154-8

    Article  Google Scholar 

  56. TPC-H Benchmark. http://www.tpc.org/tpch/ (2014)

  57. Using Regular Expressions in Oracle Database. http://docs.oracle.com/cd/B19306_01/appdev.102/b14251/adfns_regexp.htm (2014)

  58. Vasiliadis, G., Polychronakis, M., Ioannidis, S.: Parallelization and characterization of pattern matching using GPUs. In: IISWC, pp. 216–225 (2011). doi:10.1109/IISWC.2011.6114181

  59. Weiner, P.: Linear pattern matching algorithms. In: Swat, IEEE Computer Society, pp. 1–11 (1973). doi:10.1109/SWAT.1973.13

  60. Wu, H., Diamos, Gr., Sheard, T., Aref, M., Baxter, S., Garland, M., Yalamanchili, S.: Red Fox: an execution environment for relational query processing on GPUs. In: International Symposium on Code Generation and Optimization (CGO) (2014)

  61. Yersinia Pestis Chromosome. ftp://ftp.sanger.ac.uk/pub/project/pathogens/yp/Yp.dna (2001)

  62. Zhang, E.Z., Jiang, Y., Guo, Z., Shen, X.: Streamlining GPU applications on the fly: thread divergence elimination through runtime thread-data remapping. In: ICS (2010). doi:10.1145/1810085.1810104

  63. Zhang, E.Z., Jiang, Y., Guo, Z., Tian, K., Shen, X.: On-the-fly elimination of dynamic irregularities for GPU computing. In: ASPLOS (2011). doi:10.1145/1950365.1950408

  64. Zha, X., Sahni, S.: GPU-to-GPU and host-to-host multipattern string matching on a GPU. IEEE Trans. Comput. 62(6), 1156–1169 (2013). doi:10.1109/TC.2012.61

    Article  MathSciNet  Google Scholar 

  65. Zu, Y., Yang, M., Xu, Z., Wang, L., Tian, X., Peng, K., Dong, Q.: GPU-based NFA implementation for memory efficient high speed regular expression matching. PPoPP (2012). doi:10.1145/2145816.2145833

    Google Scholar 

  66. Zukowski, M.: Balancing Vectorized Query Execution with Bandwidth-Optimized Storage. PhD thesis, Universiteit van Amsterdam (2009)

Download references

Acknowledgments

This material is based upon work supported by National Science Foundation Grant IIS-1218222, an IBM Ph.D. Fellowship, an Onassis Foundation Scholarship, and by an equipment gift from Nvidia Corporation. We would like to thank MSc student Le Chang for the initial implementation of the code.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Evangelia A. Sitaridi.

Appendices

Appendix 1

Here, we show two additional examples of the segmentation and pivoting methods. The pattern being searched is ‘ATG’ and the input strings are 16 characters long.

Fig. 20
figure 20

Execution of Pivot-4 method for ‘ATG’ pattern using the KMP-Hybrid string matching method and 4 GPU threads. For each iteration, the numbers in the top line show the string indexes and the numbers in the bottom line the memory addresses. There are partial matches in some strings so in the third iteration some threads will scan different indexes of the first pivoted piece. Specifically, T1 and T3 fall behind after the second iteration because they have to shift the pattern. When all threads finish processing the first pivoted piece (iteration 6), threads synchronize and scan the first character of the second pivoted piece

Fig. 21
figure 21

Execution of Seg-4-4 method for ‘ATG’ pattern. Each warp has 8 threads for reasons of simplicity. The input strings are 16 characters long, and four threads process each input string. The search pattern ‘ATG’ is present in the last string processed by the last four threads of W1. Each thread will process six characters to include the boundary search cost between different segments. After the third iteration threads T5-8 of the second warp (W1) are inactive because T7 of the second warp has located the pattern

Fig. 22
figure 22

Worst-case memory access behavior of KMP_Hybrid for a string of 512 characters

Figure 20 shows the execution of the Pivot-4 KMP-Hybrid method. Different threads are processing different strings. Threads start scanning the first character of the first pivoted piece. In the internal loop of KMP-Hybrid the input is advanced either if the comparison of the current pattern character to the input succeeds or if it fails on the first character of the pattern (\(j==0\)). T1, T3 match their input in first iteration to the pattern. In the second iteration the comparison fails for both T1 and T3 so they have to consult the nxt table to shift the pattern. In the third iteration T1 and T3 will compare the second input character again after shifting the pattern. Threads synchronize again in the seventh iteration after all threads process the first piece and scan the first character of the second pivoted piece.

Figure 21 shows the execution of Seg-4-4 method: The segment size is four characters and four threads are processing concurrently each input string and two warps in total process the four input strings. Each thread, but the last in each group, is processing six characters in total to locate pattern occurrences that might occur between different segments.

Appendix 2

Figure 22 shows the long-term worse-case memory access behavior of KMP_Hybrid for a string of 512 characters. The behavior stabilizes after 128 characters (the length of the search pattern used in this worse-case example).

Appendix 3

Table 11 This table summarizes the CPU versus GPU comparison for Q13_1

Table 11 shows the results for the following subquery of TPC-H Q13.

figure j

GPU has more than 3x times better query performance and 2.42x less the energy consumption.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sitaridi, E.A., Ross, K.A. GPU-accelerated string matching for database applications. The VLDB Journal 25, 719–740 (2016). https://doi.org/10.1007/s00778-015-0409-y

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00778-015-0409-y

Keywords

Navigation