Autoscaling Bloom filter: controlling trade-off between true and false positives

A Bloom filter is a special case of an artificial neural network with two layers. Traditionally, it is seen as a simple data structure supporting membership queries on a set. The standard Bloom filter does not support the delete operation, and therefore, many applications use a counting Bloom filter to enable deletion. This paper proposes a generalization of the counting Bloom filter approach, called “autoscaling Bloom filters”, which allows adjustment of its capacity with probabilistic bounds on false positives and true positives. Thus, by relaxing the requirement on perfect true positive rate, the proposed autoscaling Bloom filter addresses the major difficulty of Bloom filters with respect to their scalability. In essence, the autoscaling Bloom filter is a binarized counting Bloom filter with an adjustable binarization threshold. We present the mathematical analysis of its performance and provide a procedure for minimizing its false positive rate.


Introduction
Many applications require fast and memory efficient querying of an item's membership in a set. A Bloom filter (BF) is a simple binary data structure, which supports approximate set membership queries. From a neural processing point of view, BFs are a special case of an artificial neural network with two layers (input and output) where each position in a filter is a binary neuron. Such network does not have interneuronal connections, i.e. output neurons (positions of a filter) have only individual connections with themselves and the corresponding input neurons. The standard BF (SBF) allows adding new elements to the filter and is characterized by absolute true positive rate while allowing nonzero false positive rate. These performance characteristics depend on the filter's parameters including the number of hash functions and the size of the BF. SBF, however, lacks the functionality of deleting an element. Therefore, a counting Bloom filter (CBF) [Fan et al(2000)], providing the delete operation, is commonly used. When the size of the CBF and the number of elements of a set to be stored are known, the the number of hash functions can be optimized to minimize the false positive rate. Such optimization is, however, very complicated when the number of the stored elements varies dynamically and boils down to computationally expensive recalculation of the content of the filter. To address the aforementioned issue, we propose an autoscaling Bloom filter (ABF) that significantly reduces the false positive rate only by optimizing the update rule of the filter. ABF operates with the fixed amount of resources (k hash functions) for a wide dynamic range of input elements. ABF, however, slightly reduces the perfect true positive rate of CBFs that can be tolerated in many applications including networking [Donnet et al(2006)], and generally in the area of approximate computing where errors and approximations are becoming acceptable as long as the outcomes have a well-defined statistical behavior [Akhlaghi et al(2016)].
ABF belongs to a class of binary BFs and constructed by binarization of the CBF with the binarization threshold as a parameter. Fig. 1 illustrates the main idea behind the ABF. Fig. 1.a exemplifies the CBF of size 20 which stores four elements (x 1 to x 4 ). Each element is mapped to three different positions of the filter. Values of positions vary between 0 and 4 (highlighted by different colors). The SBF ( Fig. 1.b) is formed by setting all nonzero positions of the CBF to one 1 . Two lower parts of the figure present two examples of the ABF with different binarization threshold Θ (Θ = 1 and Θ = 3 respectively). Element y is queried for the set membership. Note that, in the SBF all nonzero positions of y are set to one, hence, y is an example of false positive. In contrast, in Fig. 1.c y has only one position in common with the ABF while all elements x i have at least two positions, thus, y will be correctly rejected by the ABF. On the other hand, the ABF in Fig. 1.d (Θ = 3) is too sparse and, therefore, erroneously returns negative results even for elements x i .
We also present an interplay between BFs and hyperdimensional computing [Kanerva(2009)] as another neural information processing approach. Hyperdimensional computing provides a bio-inspired representation of structured knowledge. Its development was stimulated by studies on brain activity that showed that processing of a simple mental events involves simultaneous activity in many dispersed neurons. Information in hyperdimensional computing is represented in a distributed fashion [Browne(1996)]: a single concept is associated with a pattern of activation of many neurons. This is achieved by using vectors with very large dimensions. This paper explores a direct correspondence between BFs and hyperdimensional representations. BFs are treated as a special case of distributed representations where each elements stored in the BF is a hyperdimensional binary vector constructed through hash functions. The construction of the filter itself corresponds to bundling operation [Rachkovskij(2001)] of binary vectors. The mathematics of sparse hyperdimensional computing [Rachkovskij(2001)] (SHC) is used for describing the behavior of the proposed ABF (see comparison in Table 1). Essentially, the ABF is a generalized form of binary BFs with probabilistic bounds on false positives and true positives. This paper presents the mathematical analysis and experimental evaluation of the ABF properties. It also suggests a procedure for automatic minimization of the false positive rate adopting to the number of the elements in the filter. To the best of our knowledge, this is a totally novel variant of BFs, which makes them particularly useful in scenarios where the number of the stored elements is unknown or changes dynamically with time.
The paper is structured as follows: Section 2 presents a concise survey of the related approaches. Section 3 describes the ABF and introduces analytical expressions characterizing its performance. The evaluation of the ABF is presented in Section 4. The paper is concluded in Section 5.

Related Work
Detailed surveys on BFs and their applications are provided in [Tarkoma et al(2012)] and [Broder and Mitzenmacher(2004)], this section overviews the approaches most relevant to the presented ABF approach. The improvement of the SBF performance is a popular research topic. The ternary BF [Lim et al(2017)] improves the performance of the CBF as it only allows three possible values of each position. The deletable BF [Rothenberg et al(2010)] uses additional positions in the filter, which are used to support the deletion of elements from the filter without introducing false negatives. The complement Bloom Filter [Lim et al(2015)] uses an additional BF in order to identify the trueness of BF positives. The cross-checking BF [Lim et al(2014)] constructs several additional BFs, which are used to cross-check the main BF if it issues the positive result. The on-off BF [Carrea et al(2016)] reduces false positives by including in the filter additional information about the elements, which generate false-positives. The retouched BF (RBF) [Donnet et al(2006)] ideologically is the most relevant approach to the ABF since it allows some false negatives to decrease false positive rate. The major difference to the proposed approach is that RBF eliminates false positives, which are known in advance. In contrast to the previous work, ABF is suitable for reducing false positive rate even when the whole universe of elements is either unknown or is too large to use additional mechanisms for encoding the elements not included in the filter.

Preliminaries
Without loss of generality, suppose that k results of hash functions applied to an element q from the universe never coincide, that is all k indices pointing to positions in the filter are unique. In this case, BF of a single element (denoted as q) from the universe can be seen as a draw from the hypergeometric distribution with a population of size m that contains exactly k successes (positions set to one). The probability of a success in a particular position is: p 1 = k/m.
Individual BFs representing n unique elements (denoted as x i ) are stored in the CBF as: The SBF is related to the CBF as follows: SBF = [CBF > 0], where [] means 1 if true and 0 otherwise. Given the values of m and n, k that minimizes false positive rate for the SBF (CBF) can be found as: k = (m/n) ln 2. (2) The probability of an empty position p 0 in the SBF (CBF) when the results of hash functions can coincide, is: When performing the set membership query operation with the SBF (denoted as SBF), an element q (with an individual BF q) might be a member of the SBF if and only if the dot product (denoted as d) between SBF and q equals the number of nonzero positions in q i.e., k: d(SBF, q) = SBF · q == k.
A value in ith position of the CBF is characterized by a discrete random variable (denoted as I) in the range I ∈ Z|0 ≤ I ≤ n. It is described by the binomial distribution: I ∼ B(n, p 1 ). Given the parameters for the distribution, the probability of value v is: According to (4), the probability of an empty position in the SBF (CBF) is: In fact, (5) differs from the standard expression (3) for p 0 . However, both produce different results only for small lengths of the filter (m < 50), which are not of a practical importance.
Because each position in the CBF can be treated as an independent variable, the expected number of positions l with value v equals:

Definition of Autoscaling Bloom Filter
Given the CBF, the ABF is formed by setting to zero all positions which values are less than or equal the chosen threshold Θ; positions which are greater than Θ are set to one: Note that when Θ = 0, the ABF is equivalent to the SBF. When the threshold for the ABF is more than zero, the probability of an empty position in the ABF (denoted as P 0 ) is higher than in the SBF because a portion of nonzero positions in the CBF is set to zero. For a given Θ, P 0 is calculated as follows: Then the probability of one in the ABF (denoted as P 1 ) is: In general, the expected dot product (denoted asd x ) between the ABF and an element x included in the filter is less or equal k. It is due to the suppression of nonzero positions in the CBF as these positions contribute to the dot product in the case of SBF while they are set to zero in the ABF. Therefore, there is a need for the second parameter of the ABF, which determines the lowest value of dot product indicating the presence of an element in the filter. Denote this parameter as T (0 ≤ T ≤ k), then an element of the universe q might be a member of the ABF if and only if the dot product between ABF and q is greater than or equal T .
The expected dot product for an element included in the ABF is calculated as: Note that when Θ = 0,d x (ABF, x) = k. In other words, it makes the SBF to be a special case of the ABF.
The expected dot product (denoted asd y ) between the ABF and an element y which is not included in the filter is determined by the number of nonzero positions in the filter and calculated as:d Both dot products d x and d y are characterized by discrete random variables (denoted as X and Y respectively) which in turn described by binomial distributions: X ∼ B(k, p x ) and Y ∼ B(k, p y ).
The success probabilities (p x and p y ) of these distributions are determined from the expected values of dot product as in (10) and (11): p y =d y /k = P 1 .

Performance properties of ABF
Given T , true positive rate (TPR) of the ABF can be calculated using the probability mass function of X as: Similarly, false positive rate (FPR) is calculated using the probability mass function of Y as: 4 Evaluation of ABF

Optimization of ABF's parameters
In order to choose the best value of T (or even both Θ and T ), an optimization criterion is needed. It is proposed to optimize the accuracy of the filter (ACC), which is defined as the average value of true positive rate and true negative rate: ACC = (TPR + (1 − FPR))/2. Note, that this definition of accuracy is also referred as unweighted average recall. In addition, an application may specify the lowest acceptable TPR (denoted as L TPR ) then the optimal value of T (for the fixed Θ) is found as: In general, both parameters of the ABF: Θ and T can be optimized as:

An example: ABF in action
The behavior of ABF for different Θ is illustrated in Fig. 2. The length of the CBF (and all derived ABFs) is m = 10, 000. It stores n = 500 unique elements, each element is mapped to an individual BF with k = 100 nonzero positions. Note that, the value of k is intentionally not optimized for the given m and n. Six ABFs are formed from the CBF using different thresholds in the range 0 ≤ Θ ≤ 5. Each plot in Fig. 2 corresponds to one ABF and depicts probability mass functions for X (circle markers) and Y (diamond markers). The plot for Θ = 0 corresponds to the SBF. In this case, X is deterministic and located at k = 100 as expected for the SBF, hence, the optimal value of T is trivial; it also equals k and TPR = 1. A large portion of the distribution for Y is also concentrated at k = 100 that leads to high FPR=0.52. On the other hand, the usage of the ABFs with Θ > 0 result in a better separation of two distributions. Much lower FPR can be achieved by compromising 100% TPR. The optimal values of T (indicated by black vertical bars) were found for each value of Θ according to (16). The lowest acceptable value of TPR, L TPR was set to 0.97. The best values of TPR, FPR, and ACC for each plot are depicted in the figure. For example, even changing Θ from 0 to 1 allows to decrease FPR from 0.52 to 0.24 by loosing only 3% of TPR. Overall, the accuracy is improved by 0.13. The best performance among the considered range is achieved for Θ = 4, resulting in TPR = 0.98, FPR = 0.04, ACC = 0.97, thus, improving the accuracy of the SBF by 31%.  Fig. 3 corresponds to a performance metric: left -TPR; center -FPR; right -ACC. The performance was studied for varying number of unique elements stored in the filter (50 ≤ n ≤ 5, 000). The length of the filters was the same as in Fig. 2,   Fig. 3. Comparison of performance (TPR, FPR, and ACC) of three different BFs against varying number of stored elements n (50 ≤ n ≤ 5, 000 step 50). m = 10, 000. For the optimized BF, k was calculated as in (2) for each value of n and varied between 1 and 139. For autoscaling and nonoptimized BFs k was fixed to 100. The ABF was formed from the CBF. Only two parameters (Θ and T ) of ABF were optimized for each value of n according to (17) with L TPR = 0.9. Note that these two parameters do not change the amount of hardware resources for an ABF implementation since k is fixed, while an optimized BF implementation might require 1.4x higher hash functions. This overhead directly translates to a larger silicon area or slower speed for the hardware implementation of the optimized BF compared to the ABF. TPR of the optimized and nonoptimized BFs is always 1 while TPR of the ABF varies in the allowed range between L TPR and 1. For large values of n (>1000) TPR of the ABF is constantly chosen to be close to L TPR . FPR of all filters is increasing with the growth of n. As was anticipated, the nonoptimized BF soon (at n ≈ 1000) achieves FPR = 1 and stays there until the end. In contrast, both the ABF and the optimized BF demonstrate smooth increase in FPR. It is lower than 1 for both filters even when n = 5, 000 (approximately 0.6 and 0.4 respectively). The accuracy curves aggregate the behavior for TPR and FPR. For most values of n, the nonoptimized BF features ACC = 0.5 as its FPR = 1. The accuracies of the ABF and the optimized BF smoothly decay with the growth of n being 0.66 and 0.8 when n = 5, 000. Thus, the ABF significantly outperforms the nonoptimized BF when its FPR starts to increase. In general, the performance of the ABF follows that of the optimized BF with some constant loss. The best trade-off between TPR and FPR is in the region of n where FPR of the nonoptimized BF is steeply increasing from 0 to 1. The important advantage of the ABF over the optimized BF is that it does not require the recalculation of the whole filter as the number of the stored elements is increasing while the optimized BF should be recalculated if another value of k is chosen.

Conclusion
This paper introduced the autoscaling Bloom filter. The ABF is a generalization of the standard binary BF with procedures for achieving probabilistic bounds on false positives and true positives. It was shown that the ABF can significantly decrease false positive rate at a cost of allowing some nonzero false negative rate. The evaluation revealed that the accuracy of the ABF follows the standard BF with the optimized number of hash functions with some constant loss. As opposed to the optimized BF, the ABF provides means for optimization of filter's performance when the number of stored elements in the filter is dynamically changing, while the number of hash functions is fixed.