Hunting the pertinency of hash and bloom filter combinations on GPU for fast pattern matching

There has been rapid growth in the field of graphical processing unit (GPU) programming due to the drastic increase in the computing hardware manufacturing. The technology used in these devices is now more affordable and accessible to the general public. With this growth, many serial programming applications that are now being transformed into more efficient parallel programming applications with significant improvement in the performance. The best example for this is parallel implementation of the probabilistic data structure Bloom filter in set membership queries. However, despite of it’s remarkable performance in speed and memory usage, there is a computational overhead in the calculation of hashes in Bloom filter. In this paper, the impact of the choice of hash functions on the qualitative properties of the Bloom filter has been experimentally recorded and the results show that there is a possibility of large performance gap among various hash functions. We have implemented the Bloom filter based pattern matching technique on GPU using compute unified device architecture (CUDA) and benchmark the performance of several cryptographic and non-cryptographic hash functions.


Introduction
The Bloom filter [2] is a fast and memory-efficient probabilistic data structure that can be used to test set membership. The Bloom filter with m bits and k hash functions has a time complexity of OðkÞ for both insertion and query operations. This complexity is entirely independent of the number of elements that are already inserted in the set. The Bloom filter is significantly better in terms of space complexity when compared to standard storage data structures because it doesn't need to store the actual element but it stores the hash of each element. The only significant computational overhead involved in the functioning of a Bloom filter is hashing. The selection of the appropriate hash functions and an optimal number of hash functions affects the probability of false positives. The selected hash functions must be independent, uniformly distributed, and fast. The simple non-cryptographic hash functions will satisfy these criteria and provide a significant performance improvement. In contrast, cryptographic hash functions provide additional stability at the cost of computation and therefore they are computationally expensive.
But, most of the research works involving the implementation of Bloom filters only focus on developing extensions on Bloom filter instead of focusing on the hash function used in it. Furthermore, the majority of these implementations have been developed on CPUs. This paper focuses on two different dimensions of standard Bloom filter to improve it's overall performance-choosing the correct hash functions and selecting appropriate hardware. We have clearly shown the non-intuitiveness of best performance of Bloom filter without considering the underlying hash functions. The insertion of an element in Bloom filter requires k hashes to be computed, which occurs sequentially on CPUs. The calculation of these hash functions can be done in parallel without impacting the Bloom filter's qualitative properties.
We have used a modified version of the One-Hashing algorithm [19] to generate k different hashes using only one hash function to make it possible to compare various hashes independently. We have mainly focused implementing standard Bloom filter by parallelizing the hash function computation and compared various cryptographic and non-cryptographic hash functions, and documented the impact on different parameters of the Bloom filter. In this work, our goal is to create a standard hash function reference that can be used to choose the appropriate hash and the hardware environment in which the Bloom filter is deployed.

Related work
This section describes the existing schemes on Bloom filter applications in variety of domain such as bioinformatics, pattern matching, database storage, networking and security etc.

Bioinformatics
Liu et al. [18] introduced the first use of hybrid mode of CPU and GPU computations in DNS sequencing error detection using high throughput short read (HTSR) dataset. Ma et al. [21] have proposed a parallel bloom filter pattern search method for genome biosequence alignment applications using a standard bit vector bloom filter. Heo et al. [34] introduced the use of hybrid filters at different phases of the DNA sequencing. Refer Tables 1 and 2 for summary of all the above mentioned schemes.

Pattern matching
Moraru and Andersen [22] improved Rabin-Karp pattern search method by introducing two types of Bloom filters-(i) cache-partitioned (ii) feed-forward. Ong et al. [24] compare the computational improvement of CUDA implementation versus a multi-core implementation of a string searching algorithm using Bloom filters. Lin et al. [17] introduces a new GPU-assisted string matching method called parallel failureless-Aho-Corasick Algorithm (PFAC) and its variants which exhibit better execution performance over Aho-Corasick algorithm. Zhang et al. [37] has extended the basic operation of Bloom filter and exploited the parallel capabilities of graphical processing unit to get improved performance over existing methods. Hayashikawa et al. [12] have proposed slightly modified form of classic Bloom filter called folded Bloom filter to overcome the high false positive rate of blocked Bloom filter [27] for pattern matching applications. Sachendra and Shalini [3] showcase the superior performance of a similarity search using standard bloom filter on GPU. Even though the proposed technique claims better running time performance over standard Dice coefficient similarity search [6,29] method, there are three fundamental problems with this technique (i) entire input will be pre-stored on GPU global memory, (ii) the input text document must be converted to shingles for feature extraction, (iii) the input query must be represented as an integer array. Wada et al. [31] have constructed a circuitlevel pattern matching Bloom filter using Block RAM and Ulter RAM combinations in restricted memory devices such as in field programmable gate array (FPGA). In this work, multiple rolling hash functions [15] are used to restrict the false-probability rate and to obtain speedy pattern matching results. Refer Tables 3 and 4 for summary of all the above mentioned schemes.

Database storage
Gubner et al. [10] propose a heterogenous query processing framework based on fluid-coprocessing where tasks of different sizes that can fit into memory are dynamically processed together. Tim et al. [11] have demonstrated the use of Bloom filter in accelerating the join operation in the database to achieve better early pruning in join queries. Ma et.al [21] Classic, multiple Hybrid (CPU (Insert) and GPU (Query)) Heo et al. [34] Hybrid, multiple (classic ? counting) GPU

Networking and security
Gholami et al. [9] utilize three standard Bloom filters along with GPU programming in order to increase the network packet classification speed and decrease memory consumption. The open-source modular router called Click [23] is utilized to perform packet classification. Dyumin et al. [7] focus on the implementation of a modified counter Bloom filter and showcases the results of statistical analysis with different varying parameters like number of hash functions, length of the counter, filter size, number of input elements. Hung et al. [13] propose a solution for accelerating traditional network intrusion detection systems by using Bloom filters through the implementation of a GPUbased multiple pattern matching algorithm for packet filtering. Xiong et al. [32] showcase the implementation of a probabilistic Bloom filter on GPU to analyze the frequency of traffic flow in a network.

Miscellaneous
Dharmapurikar et al. [5] have identified the inability of software-based network monitoring systems to cope up with the line speed of the network in detecting special type of patterns (or attacks) in network packets. To overcome this, authors have proposed a field programmable gate array (FPGA)-based network monitoring system using counting bloom filter using universal hash functions [28]. Costa et al. [4] primarily focused to offload computation load of a classic Bloom filter on a separate co-processor like GPU. Zhang et al. [36] focus on minimizing the time required to process the intersection of sorted inverted lists   using GPU programming. The algorithm assigns every unique document in a list to a unique GPU thread that uses a parallel binary search algorithm that takes O(log(n)) memory accesses. Iacob et al. [14] propose a solution for information retrieval which is a logical extension for pattern matching using GPUs. Yao et al. [35] proposed the first probabilistic Bloom filter (PBF) which can probabilistically flip the filter bits with some probability. Sisi et al. [33] have proposed first GPU-assisted probabilistic counting Bloom filter (PBF) over classic Bloom filter using the method of probabilistic flipping for monitoring network traffic for suspicious activities such as denial-of-service (Dos) attack. Xiong et al. [33] To augment with dynamic traffic authors have proposed the modified version of PBF over [35] with careful analysis for optimal parameter selection using game theory techniques. Tripathy et al. [30] highlights on the improved performance in larger similarity search concept trees using a GPU co-processor.

Background
This section describes the prerequisites and background knowledge to understand the proposed scheme.

Bloom filter
The standard Bloom filter is a probabilistic data structure that allows for the testing of set membership. The false positives are possible but not false negatives. It is also an extremely fast and space-efficient data structure. It is implemented as a bit array of m bits, all of which are set to 0 when it is empty. When an element needs to be added to the Bloom filter, it passes through k different hash functions to produce k unique addresses by performing the module operation with m to produce a random address corresponding to one of the m array positions. To query for an element, it is passed through the same k hash functions to obtain k array positions. If any of the bits in one of these positions is set to 0, it can be concluded that the element is definitely not in the set. But, if all the bits are set to 1, there is a chance that the element is actually in the set. If the element is not in the set, yet all the bits are set to 1, it is a false positive. The Bloom filter basic insertion process with k ¼ 3 is shown in Fig. 1. The Eq. (1) can be used to calculate the optimal values for the different parameters of the bloom filter: where m is the number of bits in the Bloom filter, n is the approximate number of items in the Bloom filter, p is the false positive probability, k is the number of hash functions and ROUND() returns the nearest integer multiple of 10. Several variations of Bloom filters have been introduced to improve their functionality. Patgiri et al. [26] and Luo et al. [20] showcase a good survey of the various types of Bloom filters and the domains in which they are deployed. The counting Bloom filters introduced by Fan et al. [8] provide a way to allow deletion of elements without resetting the filter but they use 3-4 times more space. The scalable Bloom filers proposed in Almeida et al. [1] have the capability to adapt to the amount of data stored dynamically by implementing a sequence of standard bloom filters with increasing capacity. The spatial Bloom filters proposed by Palmieri et al. [25] are able to store multiple sets in one data structure and allow for the prioritization among these sets which lets them preserve the important elements. Fig. 1 The bloom filter insertion process with k ¼ 3

Evolution of GPU computing and CUDA framework
The CPU contains a few cores which are optimized with a large amount of cache memory and can handle several software threads by processing them in a sequential or serial manner. In contrast, a GPU consists of massively parallel architecture which contains thousands of small, efficient cores designed to handle multiple tasks simultaneously. Historically, GPUs have been used as specialpurpose processing units for very particular use-cases like graphics processing which involves computing information of millions of pixels. Recently, GPU computing is being expanded into the domain of general-purpose computing making it easier for developers to make use of the additional processing power in their applications.

Cloud-based GPU framework
Cloud providers like Microsoft Azure, Amazon Web Services, Google Colab, and IBM SoftLayer have collaborated with Nvidia to provide on-demand GPU access over the internet. Several challenges do exist when providing access to high performance computing over the cloud. Some of the biggest challenges are latency, storage speed, and cloud virtualization. Different high performance applications required varying amounts of computer power, I/O bottlenecked applications can utilize a high latency connection but more demanding applications require low-latency high throughput connections. Overall, the growing cloud infrastructure today makes it possible to provide on-demand GPU access which has the potential to enable many technological innovations without requiring heavy capital investment. Th parallel computing is now more affordable and accessible than ever before and developers have the capability to significantly enhance their software performance by exploiting this infrastructure.

CUDA framework
This framework allows developers to write programs for the Nvidia family of GPUs with ease. It provides abstractions over threads groups, shared memories, and barrier synchronization by exposing them to the programmer in a simple manner. The GPU consists of many streaming multiprocessors (SMs) and a multithreaded program is partitioned into a block of threads that execute independently and are allocated to these SMs. The global memory is the largest memory space on the GPU and also has the highest latency. The data that needs to remain persistent throughout the program execution is stored here. The shared memory is fast, small (a few kilobytes), and is allocated to each multiprocessor intended to be used as an application cache. The other types of memory are: texture, constant, and register memories. The threads running on GPU cannot access the memory of the host hence all data required for computation must to transferred from the host main memory to the GPU's global memory. Then, a kernel function is called which launches a large number of parallel threads to perform the computation whose results are stored in the global memory. Finally, these results are copied back to the host memory for post-processing. To maximize performance, the following strategies must be used: (i) maximize parallel execution to achieve maximum utilization, (ii) optimize memory usage to achieve maximum memory throughput, (iii) optimize instruction usage to achieve maximum instruction throughput, (iv) minimize memory thrashing

Proposed scheme
This section describes the proposed system model and it's implementation on the graphical processing unit.
Our implementation consists of four phases as shown in Fig. 2: • Phase-1: Random data generation.
Phase-1 and Phase-2 are common for the CPU and GPU implementations. In the first phase, a list of randomized strings between 16 and 128 bytes are generated in which each string serves as input data for both the insertion and query operations on Bloom filter.
In the second phase, the generated random data is concatenated to form a long string of strings separated by a single space character to facilitate the direct transfer of this concatenated string to the GPU global memory and stored in a contiguous array without the need to process each input string independently. The pseudocodes for Phase-1 and Phase-2 are described in Algorithm 1 and Algorithm 2. We have implemented the first and second phases of our proposed scheme using the Python scripting language. This allows us to measure and document the performance of just the Bloom filter operations without the need to include the data generation and preprocessing phases.
Phase-3 and Phase-4 are implemented using CUDA programming framework as shown in Fig. 3 on Nvidia GPU in order to optimize the insertion and query operations of the Bloom filter data structure. Specifically, the large number of streaming multiprocessors available on the GPU are utilized to reduce the time taken to complete the hashing phase of the Bloom filter operations. For the CPU version, Phase-3 is implemented using the C language without the use of any multi-threading paradigms in order to serve as a serial implementation benchmark to be compared with the parallel GPU version. The pseudocode for Phase-3 and Phase-4 is described in Algorithm 3. The pseudocodes for pattern insert and parallel pattern search are described in Algorithm 4 and Algorithm 5.

Implementation of modified one-hashing algorithm
We have implemented a slightly modified version of the one-hashing algorithm wherein rather than using consecutive prime numbers as modulo operands, we use the current iteration of the hash function as modulo operands. Furthermore, we utilize the technique proposed in Kirsch et al. [16], but rather than using two independent hash functions, we split a given hash function into two parts. Essentially, our hash computation is a hybrid approach of both onehashing bloom filter (OHBF) [19] and less-hashing bloom filter (LHBF) Kirsch et al. [16]. The difference between the traditional hashing implementation and the proposed hashing implementation is given in Fig. 4a, b.

Performance trade-off in hash functions
The cryptographic hash functions display a good quality of randomness and uniform distribution but are computationally expensive. The popular choices for cryptographic hash functions are SHA1, MD5 and BLAKE3. The cost of these functions is usually heavily dependant on the size of the input data, which essentially rules them out for applications where long strings need to be hashed to generate unique addresses for the Bloom filter. The noncryptographic hash functions are computationally inexpensive and often used as part of Bloom filter implementation. These non-cryptographic hash functions display relatively low amount of randomness which often leads to an increase in the false-positive probability rate. Nevertheless, they still remain a popular choice, especially the hash functions XXHASH32, DJB2, Jenkins, and APHash etc. We have implemented and recoded the average performance of various cryptographic and non-cryptographic hash functions as shown in Figs. 5 and 6 to find the suitability for the proposed scheme. The CPU version is faster than the GPU version when the input contains less than 15-25 thousand strings as shown in Figs. 7 and 8. This is justified due to the memory transfer latency in the GPU version wherein the input data and the index array have to be copied from the host memory to the global memory. But, when the amount of data is larger, the parallel implementation outperforms the serial implementation as shown in Figs. 7 and 8. Further optimizations are possible which would make the parallel implementation even more efficient. We have also implemented the same in OpenMP and the results are The non-cryptographic hash function performance on CPU Fig. 6 The cryptographic hash function performance on CPU benchmarked as shown in Fig 9. The same set of input strings is fed into GPU implementation with variety of hash functions and their performance is benchmarked in Figs. 10, 11, 12, 13 and 14. As expected, non-cryptographic hash functions outperform cryptographic hash functions in terms of execution speed. This benchmark result along with the domain of the application in which a Bloom filter is deployed are two major factors that must be deliberated before choosing the appropriate hash function. In both CPU and GPU benchmarking scenarios, the number of input strings is incremented by 10,000 for every iteration up to 1 million strings. With careful result analysis, it is intuitive that the query time along with insertion time provides a good metric for real-world simulation because the Bloom filter data structure stays in the global memory as it would in a model deployed in a live project.

Conclusion and future work
We have implemented standard Bloom filter on GPU using the CUDA framework. We have also utilized the slightly modified version of one-hashing algorithm to compute multiple hash functions by using single hash value. Even though this one-hashing algorithm allowed us to Fig. 7 The Bloom filter insertion performance on CPU v/s GPU Fig. 8 The Bloom filter Insert?Query performance on CPU v/s GPU Fig. 9 The Bloom filter insertion performance on CPU v/s CPU-OpenMP Fig. 10 The hash function performance on GPU for 10 5 words with 10 4 steps benchmark various hash functions without combining several varieties of hash functions we have explicitly compared several cryptographic and non-cryptographic hash functions on CPU and GPU. From the implementation results, it is intuitive that the GPU implementation of the Bloom filter is more efficient than the CPU version when the data transfer overhead is large. Furthermore, it is also observed that non-cryptographic hash functions are significantly faster than cryptographic hash functions on the GPU implementation. Further, the domain-specific GPU optimizations such as usage of shared memory, memory coalescing, etc. can be adopted in the proposed scheme to take the performance to the next level.
Author Contributions RB contributes to conceptualization, methodology, modeling, implementation, interpretation and validation of results, manuscript writing and review. RKT contributes to implementation, interpretation and validation of results, manuscript review. RPV contributes to implementation, interpretation and validation of results, manuscript review.
Funding Open access funding provided by Manipal Academy of Higher Education, Manipal. Fig. 11 The hash function performance on GPU for 10 5 words in 20 4 steps Fig. 12 The hash function performance on GPU for 10 5 words with 40 4 steps Fig. 13 The hash function performance on GPU for 10 5 words in 60 4 steps Fig. 14 The hash function performance on GPU for 10 5

words in 80 4 steps
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons. org/licenses/by/4.0/.