The VLDB Journal

, Volume 25, Issue 5, pp 719–740 | Cite as

GPU-accelerated string matching for database applications

  • Evangelia A. Sitaridi
  • Kenneth A. Ross
Special Issue Paper


Implementations of relational operators on GPU processors have resulted in order of magnitude speedups compared to their multicore CPU counterparts. Here we focus on the efficient implementation of string matching operators common in SQL queries. Due to different architectural features the optimal algorithm for CPUs might be suboptimal for GPUs. GPUs achieve high memory bandwidth by running thousands of threads, so it is not feasible to keep the working set of all threads in the cache in a naive implementation. In GPUs the unit of execution is a group of threads and in the presence of loops and branches, threads in a group have to follow the same execution path; if some threads diverge, then different paths are serialized. We study the cache memory efficiency of single- and multi-pattern string matching algorithms for conventional and pivoted string layouts in the GPU memory. We evaluate the memory efficiency in terms of memory access pattern and achieved memory bandwidth for different parallelization methods. To reduce thread divergence, we split string matching into multiple steps. We evaluate the different matching algorithms in terms of average- and worst-case performance and compare them against state-of-the-art CPU and GPU libraries. Our experimental evaluation shows that thread and memory efficiency affect performance significantly and that our proposed methods outperform previous CPU and GPU algorithms in terms of raw performance and power efficiency. The Knuth–Morris–Pratt algorithm is a good choice for GPUs because its regular memory access pattern makes it amenable to several GPU optimizations.


Text queries String matching GPU Processing Thread divergence Cache efficiency 



This material is based upon work supported by National Science Foundation Grant IIS-1218222, an IBM Ph.D. Fellowship, an Onassis Foundation Scholarship, and by an equipment gift from Nvidia Corporation. We would like to thank MSc student Le Chang for the initial implementation of the code.


  1. 1.
    Aho, A.V., Corasick, M.J.: Efficient string matching: an aid to bibliographic search. Commun. ACM 18(6), 333–340 (1975)MathSciNetCrossRefMATHGoogle Scholar
  2. 2.
    Apostolico, A., Giancarlo, R.: The Boyer Moore Galil string searching strategies revisited. SIAM J. Comput. 15(1), 98–105 (1986). doi: 10.1137/0215007 MathSciNetCrossRefMATHGoogle Scholar
  3. 3.
    Bakkum, P., Chakradhar, S.: Efficient Data Management for GPU Databases. (2012)
  4. 4.
    Bakkum, P., Skadron, K.: Accelerating SQL database operations on a GPU with CUDA. In: GPGPU (2010). doi: 10.1145/1735688.1735706
  5. 5.
    Bellekens, X., Andonovic, I., Atkinson, R., Renfrew, C., Kirkham, T.: Investigation of GPU-based pattern matching. In: The 14th Annual Post Graduate Symposium on the Convergence of Telecommunications, Networking and Broadcasting (PGNet2013) (2013)Google Scholar
  6. 6.
    Bhargava, A., Kondrak, G.: Multiple word alignment with profile hidden Markov models. In: ACL, Companion Volume: Student Research Workshop and Doctoral Consortium, Association for Computational Linguistics, Boulder, Colorado, pp. 43–48. (2009)
  7. 7.
    Boost Library. (2014)
  8. 8.
    Boyer, R.S., Moore, J.S.: A fast string searching algorithm. Commun. ACM 20(10) (1977). doi: 10.1145/359842.359859
  9. 9.
    Breß, S., Heimel, M., Siegmund, N., Bellatreche, L., Saake, G.: GPU-accelerated database systems: survey and open challenges. T Large Scale Data Knowl. Cent. Syst. 15, 1–35 (2014). doi: 10.1007/978-3-662-45761-0_1
  10. 10.
    Carrillo, S., Siegel, J., Li, X.: A control-structure splitting optimization for GPGPU. In: CF ’09, pp. 147–150 (2009). doi: 10.1145/1531743.1531766
  11. 11.
    Cascarano, N., Rolando, P., Risso, F., Sisto, R.: iNFAnt: NFA pattern matching on GPGPU devices. SIGCOMM Comput. Commun. Rev. 40(5), 20–26 (2010). doi: 10.1145/1880153.1880157 CrossRefGoogle Scholar
  12. 12.
    Crochemore, M., Lecroq, T.: Pattern-matching and text-compression algorithms. ACM Comput. Surv. 28(1), 39–41 (1996). doi: 10.1145/234313.234331 CrossRefGoogle Scholar
  13. 13.
  14. 14.
    Design and Analysis of Algorithms Lecture Notes. (1996)
  15. 15.
    Diamos, G., Ashbaugh, B., Maiyuran, S., Kerr, A., Wu, H., Yalamanchili, S.: SIMD re-convergence at thread frontiers. In: MICRO (2011). doi: 10.1145/2155620.2155676
  16. 16.
    Fang, R., He, B., Lu, M., Yang, K., Govindaraju, N.K., Luo, Q., Sander, P.V. GPUQP: query co-processing using graphics processors. In: SIGMOD, pp. 1061–1063 (2007)Google Scholar
  17. 17.
    Farivar, R., Kharbanda, H., Venkataraman, S., Campbell, R.: An algorithm for fast edit distance computation on GPUs. In: Innovative Parallel Computing (InPar), pp. 1–9 (2012). doi: 10.1109/InPar.6339593
  18. 18.
    Ferragina, P., Manzini, G.: Indexing compressed text. J. ACM 52(4), 552–581 (2005). doi: 10.1145/1082036.1082039 MathSciNetCrossRefMATHGoogle Scholar
  19. 19.
    Fisk, M., Varghese, G.: Applying fast string matching to intrusion detection. Tech. rep., (2004)
  20. 20.
    Fung, W.W.L., Sham, I., Yuan, G., Aamodt, T.M. Dynamic warp formation and scheduling for efficient GPU control flow. In: MICRO (2007). doi: 10.1109/MICRO.2007.12
  21. 21.
    Han, T.D., Abdelrahman, T.S.: Reducing branch divergence in GPU programs. In: GPGPU, pp. 3:1–3:8 (2011). doi: 10.1145/1964179.1964184
  22. 22.
    Horspool, R.N.: Practical fast searching in strings. Softw. Pract. Exp. 10(6), 501–506 (1980). doi: 10.1002/spe.4380100608 CrossRefGoogle Scholar
  23. 23.
    Hummel, M.: Parstream—A Parallel Database on GPUs. (2010)
  24. 24.
    Intel 64 and IA-32 Architectures Software Developer’s Manual. (2011)
  25. 25.
    Iorio, F., van Lunteren, J.: Fast pattern matching on the cell broadband engine, workshop on cell systems and applications. In: The 35th International Symposium on Computer Architecture (ISCA), Beijing, China (2008)Google Scholar
  26. 26.
    Jacob, N., Brodley, C.: Offloading IDS computation to the GPU. In: ACSAC, pp. 371–380 (2006). doi: 10.1109/ACSAC.2006.35
  27. 27.
    Kaldewey, T., Lohman, G.M., Mueller, R., Volk, P.B.: GPU join processing revisited. In: DaMoN (2012)Google Scholar
  28. 28.
    Karkkainen, J., Ukkonen, E.: Sparse suffix trees. In: Cai, J.Y., Wong, C. (eds.) Computing and Combinatorics, LCNS, vol. 1090, pp. 219–230 (1996). doi: 10.1007/3-540-61332-3_155
  29. 29.
    Knuth, D.E., Morris Jr, J.H., Pratt, V.R.: Fast pattern matching in strings. SIAM J. Comput. 6(2), 323–350 (1977)MathSciNetCrossRefMATHGoogle Scholar
  30. 30.
    Kouzinopoulos, C., Margaritis, K.: String matching on a multicore GPU using CUDA. In: PCI, pp. 14–18 (2009). doi: 10.1109/PCI.2009.47
  31. 31.
    Li, J., Chen, S., Li, Y.: The fast evaluation of hidden Markov models on GPU. In: IEEE International Conference on Intelligent Computing and Intelligent Systems, 2009 (ICIS 2009), vol. 4, pp. 426–430 (2009)Google Scholar
  32. 32.
    Ligowski, L., Rudnicki, W.: An efficient implementation of Smith Waterman algorithm on GPU using CUDA, for massively parallel scanning of sequence databases. In: IEEE International Symposium on Parallel Distributed Processing, 2009 (IPDPS 2009), pp. 1–8 (2009). doi: 10.1109/IPDPS.2009.5160931
  33. 33.
    Lin, K.J., Huang, Y.H., Lin, C.Y.: Efficient parallel knuth-morris-pratt algorithm for multi-GPUs with CUDA. In: Pan, J.S., Yang, C.N., Lin, C.C. (eds.) Advances in Intelligent Systems and Applications, vol. 21, pp. 543–552 (2013). doi: 10.1007/978-3-642-35473-1_54
  34. 34.
    Lin, C.H., Tsai, S.Y., Liu, C.H., Chang, S.C., Shyu, J.M.: Accelerating string matching using multi-threaded algorithm on GPU. In: GLOBECOM, pp. 1–5 (2010). doi: 10.1109/GLOCOM.2010.5683320
  35. 35.
    Lin, C.H., Liu, C.H., Chien, L.S., Chang, S.C.: Accelerating pattern matching using a novel parallel algorithm on GPUs. IEEE Trans. Comput. 62(10), 1906–1916 (2013). doi: 10.1109/TC.2012.254 MathSciNetCrossRefGoogle Scholar
  36. 36.
    Liu, Y., Maskell, D., Schmidt, B.: CUDASW++: optimizing Smith–Waterman sequence database searches for CUDA-enabled graphics processing units. BMC Res. Notes 2(1), 73 (2009). doi: 10.1186/1756-0500-2-73 CrossRefGoogle Scholar
  37. 37.
    Marziale III, L., Richard, G.G., Roussev, V.: Massive threading: using GPUs to increase the performance of digital forensics tools. Digit. Investig. 4, 73–81 (2007). doi: 10.1016/j.diin.2007.06.014 CrossRefGoogle Scholar
  38. 38.
    Meng, J., Tarjan, D., Skadron, K.: Dynamic warp subdivision for integrated branch and memory divergence tolerance. SIGARCH Comput. Archit. News 38(3), 235–246 (2010). doi: 10.1145/1816038.1815992 CrossRefGoogle Scholar
  39. 39.
    Mostak, T., Graham, T.: Map-D Data Redefined. (2014)
  40. 40.
    Narasiman, V., Shebanow, M., Lee, C.J., Miftakhutdinov, R., Mutlu, O., Patt, Y.N.: Improving GPU performance via large warps and two-level warp scheduling. In: MICRO, pp. 308–317 (2011). doi: 10.1145/2155620.2155656
  41. 41.
    Navarro, G.: A guided tour to approximate string matching. ACM Comput. Surv. 33(1), 31–88 (2001). doi: 10.1145/375360.375365 CrossRefGoogle Scholar
  42. 42.
    Needleman, S.B., Wunsch, C.D.: A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48(3), 443–453 (1970). doi: 10.1016/0022-2836(70)90057-4 CrossRefGoogle Scholar
  43. 43.
    Netzer, O.: Getting Big Data Done on a GPU-Based Database. (2014)
  44. 44.
    Pirk, H., Manegold, S., Kersten, M.: Waste not...; efficient co-processing of relational data. In: 2014 IEEE 30th International Conference on Data Engineering (ICDE), pp. 508–519 (2014). doi: 10.1109/ICDE.2014.6816677
  45. 45.
    Pyrgiotis, T., Kouzinopoulos, C., Margaritis, K.: Parallel implementation of the Wu–Manber algorithm using the OpenCL framework. Artif. Intell. Appl. Innov. 382, 576–583 (2012). doi: 10.1007/978-3-642-33412-2_59
  46. 46.
    Rauhe, H., Dees, J., Sattler, K.U., Faerber, F.: Multi-level parallel query execution framework for CPU and GPU. In: Catania, B., Guerrini, G., Pokorny, J. (eds.) Advances in Databases and Information Systems, Lecture Notes in Computer Science, vol. 8133, pp. 330–343. Springer, Berlin (2013). doi: 10.1007/978-3-642-40683-6_25
  47. 47.
    Re2 Regular Expression Library. (2014)
  48. 48.
    Sartori, J., Kumar, R.: Branch and data herding: reducing control and memory divergence for error-tolerant GPU applications. TMM 15(2), 279–290 (2013). doi: 10.1109/TMM.2012.2232647 Google Scholar
  49. 49.
    Scarpazza, D.P., Villa, O., Petrini, F.: Peak-performance DFA-based string matching on the Cell processor. In: IEEE International on Parallel and Distributed Processing Symposium, 2007 (IPDPS 2007). IEEE, pp. 1–8 (2007)Google Scholar
  50. 50.
    Sitaridi, E.A., Ross, K.A.: Optimizing select conditions on GPUs. In: Proceedings of the Ninth International Workshop on Data Management on New Hardware (DaMoN’13). ACM, New York, NY, USA, pp. 4:1–4:8 (2013). doi: 10.1145/2485278.2485282
  51. 51.
    Smith, T., Waterman, M.: Identification of common molecular subsequences. J. Mol. Biol. 147(1), 195–197 (1981). doi: 10.1016/0022-2836(81)90087-5
  52. 52.
    Sunday, D.M.: A very fast substring search algorithm. Commun. ACM 33(8), 132–142 (1990). doi: 10.1145/79173.79184
  53. 53.
    Taylor, R., Li, X.: Software-based branch predication for AMD GPUs. SIGARCH Comput. Archit. News 38(4), 66–72 (2011). doi: 10.1145/1926367.1926379 MathSciNetCrossRefGoogle Scholar
  54. 54.
  55. 55.
    Tian, Y., Tata, S., Hankins, R.A., Patel, J.M.: Practical methods for constructing suffix trees. VLDB J. 14(3), 281–299 (2005). doi: 10.1007/s00778-005-0154-8 CrossRefGoogle Scholar
  56. 56.
    TPC-H Benchmark. (2014)
  57. 57.
    Using Regular Expressions in Oracle Database. (2014)
  58. 58.
    Vasiliadis, G., Polychronakis, M., Ioannidis, S.: Parallelization and characterization of pattern matching using GPUs. In: IISWC, pp. 216–225 (2011). doi: 10.1109/IISWC.2011.6114181
  59. 59.
    Weiner, P.: Linear pattern matching algorithms. In: Swat, IEEE Computer Society, pp. 1–11 (1973). doi: 10.1109/SWAT.1973.13
  60. 60.
    Wu, H., Diamos, Gr., Sheard, T., Aref, M., Baxter, S., Garland, M., Yalamanchili, S.: Red Fox: an execution environment for relational query processing on GPUs. In: International Symposium on Code Generation and Optimization (CGO) (2014)Google Scholar
  61. 61.
  62. 62.
    Zhang, E.Z., Jiang, Y., Guo, Z., Shen, X.: Streamlining GPU applications on the fly: thread divergence elimination through runtime thread-data remapping. In: ICS (2010). doi: 10.1145/1810085.1810104
  63. 63.
    Zhang, E.Z., Jiang, Y., Guo, Z., Tian, K., Shen, X.: On-the-fly elimination of dynamic irregularities for GPU computing. In: ASPLOS (2011). doi: 10.1145/1950365.1950408
  64. 64.
    Zha, X., Sahni, S.: GPU-to-GPU and host-to-host multipattern string matching on a GPU. IEEE Trans. Comput. 62(6), 1156–1169 (2013). doi: 10.1109/TC.2012.61 MathSciNetCrossRefGoogle Scholar
  65. 65.
    Zu, Y., Yang, M., Xu, Z., Wang, L., Tian, X., Peng, K., Dong, Q.: GPU-based NFA implementation for memory efficient high speed regular expression matching. PPoPP (2012). doi: 10.1145/2145816.2145833 Google Scholar
  66. 66.
    Zukowski, M.: Balancing Vectorized Query Execution with Bandwidth-Optimized Storage. PhD thesis, Universiteit van Amsterdam (2009)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2015

Authors and Affiliations

  1. 1.Department of Computer ScienceColumbia UniversityNew YorkUSA

Personalised recommendations