Abstract
String searching in documents has become a tedious task with the evolution of Big Data. Generation of large data sets demand for a high performance search algorithm in areas such as text mining, information retrieval and many others. The popularity of GPU’s for general purpose computing has been increasing for various applications. Therefore it is of great interest to exploit the thread feature of a GPU to provide a high performance search algorithm. This paper proposes an optimized new approach to N-gram model for string search in a number of lengthy documents and its GPU implementation. The algorithm exploits GPGPUs for searching strings in many documents employing character level N-gram matching with parallel Score Table approach and search using CUDA API. The new approach of Score table used for frequency storage of N-grams in a document, makes the search independent of the document’s length and allows faster access to the frequency values, thus decreasing the search complexity. The extensive thread feature in a GPU has been exploited to enable parallel pre-processing of trigrams in a document for Score Table creation and parallel search in huge number of documents, thus speeding up the whole search process even for a large pattern size. Experiments were carried out for many documents of varied length and search strings from the standard Lorem Ipsum text on NVIDIA’s GeForce GT 540M GPU with 96 cores. Results prove that the parallel approach for Score Table creation and searching gives a good speed up than the same approach executed serially.
Similar content being viewed by others
References
D.D.Lewis, Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval. Machine Learning: ECML-98. Lecture Notes in Computer Science (Springer Berlin Heidelberg 1998), pp. 4–15
R.S. Boyer, J.S. Moore, A fast string searching algorithm. Commun. ACM 20, 762–772 (1977)
R.M. Karp, M.O. Rabin, Efficient randomized pattern-matching algorithms. IBM J. Res. Dev. IBM J Res. Dev. 31, 249–260 (1987)
D.E. Knuth, J.J.H. Morris, V.R. Pratt, Fast pattern matching in strings. SIAM J Comput 6, 323–350 (1977)
W.B. Cavnar, J.M. Trenkle, N-gram-based text categorization. Ann Arbor MI 48113(2), 161–175 (1994)
R. Cox, Regular Expression Matching with a Trigram Index or How Google Code Search Worked (2012)
C.S. Kouzinopoulos, K.G. Margaritis, String Matching on a Multicore GPU Using CUDA, in Proceedings of the 2009 13th Panhellenic Conference on Informatics (2009). doi:10.1109/pci.2009.47Russ
J. Peng, H. Chen, S. Shi, The GPU-based String Matching System in Advanced AC Algorithm, in Proceedings of the 2010 10th IEEE International Conference on Computer and Information Technology (2010). doi: 10.1109/cit.2010.210
M. Hartung, L. Kolb, A. Groß, E. Rahm, Optimizing Similarity Computations for Ontology Matching—Experiences from GOMMA. Lecture Notes in Computer Science Data Integration in the Life Sciences (2013), pp. 81–89
G. Vasiliadis, M. Polychronakis, S. Antonatos, E.P. Markatos, S. Ioannidis, Regular Expression Matching on Graphics Hardware for Intrusion Detection. Lecture Notes in Computer Science Recent Advances in Intrusion Detection (2009), pp. 265–283
M. Góngora-Blandón, M. Vargas-Lombardo, State of the art for string analysis and pattern search using CPU and GPU based programming. Journal of Information Security JIS 03, 314–318 (2012)
M. Schatz, C. Trapnell, Fast exact string matching on the GPU. Center for bioinformatics and computational (2007)
I. Moraru, D.G. Andersen, Exact pattern matching with feed-forward bloom filters. J Exp Algorithm. (2011). doi:10.1145/2133803.2330085
M. Schatz, C. Trapnell, NVIDIA white paper, Accelerated computing and the democratization of supercomputing (2016)
S. Tomov, J. Dongarra, M. Baboulin, Towards dense linear algebra for hybrid GPU accelerated many core systems. Parallel Comput. 36, 232–240 (2010)
S. Soroushnia, M. Daneshtalab, T. Pahikkala, J. Plosila, Parallel Implementation of Fuzzified Pattern Matching Algorithm on GPU, in 2015 23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (2015). doi:10.1109/pdp.2015.75
R. Smith, N. Goyal, J. Ormont, K. Sankaralingam, C. Estan, Evaluating GPUs for Network Packet Signature Matching, in 2009 IEEE International Symposium on Performance Analysis of Systems and Software, (2009). doi:10.1109/ispass.2009.49196491
X. Zha, S. Sahni, Multipattern String Matching on a GPU, in 2011 IEEE Symposium on Computers and Communications (ISCC), (2011). doi:10.1109/iscc.2011.5983790
J. Sharma, M. Singh, CUDA based Rabin–Karp pattern matching for deep packet inspection on a Multicore GPU. Int. J. Comput. Netw. Inf. Secur. 7, 70–77 (2015)
G. Vasiliadis, M. Polychronakis, S. Ioannidis, Parallelization and Characterization of Pattern Matching Using GPUs, in 2011 IEEE International Symposium on Workload Characterization (IISWC) (2011). doi:10.1109/iiswc.2011.6114181
C.S. Kouzinopoulos, J.A.M. Assael, T.K. Pyrgiotis, K.G. Margaritis, A hybrid parallel implementation of the Aho–Corasick and Wu–Manber algorithms using NVIDIA CUDA and MPI evaluated on a biological sequence database. Int. J. Artif. Intell. Tools 24(01), 1540001 (2015)
S. Soroushnia, M. Daneshtalab, J. Plosila, P. Liljeberg, Heterogeneous Parallelization of Aho–Corasick Algorithm. Advances in Intelligent Systems and Computing 8th International Conference on Practical Applications of Computational Biology & Bioinformatics (PACBB 2014) (2014), pp. 153–160
D.B. Kirk, W.-M.W. Hwu, Programming massively parallel processors: a hands-on approach. Morgan Kaufmann (2016)
NVIDIA Corporation, NVIDIA CUDA Programming Guide, Technical Report, California, USA, (2008), pp. 1–100
L. Mussi, F. Daolio, S. Cagnoni, Evaluation of parallel particle swarm optimization algorithms within the CUDA™ architecture. Inf. Sci. 181, 4642–4657 (2011)
E. Ukkonen, Approximate string-matching with q-grams and maximal matches. Theor. Comput. Sci. 92, 191–211 (1992)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Srinivasa, K.G., Shree Devi, B.N. GPU Based N-Gram String Matching Algorithm with Score Table Approach for String Searching in Many Documents. J. Inst. Eng. India Ser. B 98, 467–476 (2017). https://doi.org/10.1007/s40031-017-0295-3
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s40031-017-0295-3