1 Introduction

Membership testing is a problem of judging whether a queried element \(e_\text {q}\) exists in a certain element set S, which is common in computer-related fields, e.g., database systems, web search, firewalls, network routing, etc. The proliferation of massive raw data brings with it a series of storage and processing challenges.

To enable fast membership testing among massive raw data, effective data structures are needed that can store raw data using small space and answer queries efficiently. As a common data structure, hash table works well in membership testing, where data are mapped on a hash table using random hash functions. Although it helps to ensure fast query to some extent, it has long been blamed for its low space efficiency. To achieve high space efficiency and high-speed query processing, Bloom filter (BF) [1], a probabilistic data structure, is proposed to perform approximate membership testing. However, it is always necessary to make a trade-off between FPR and memory usage when using Bloom filter. So, in recent years, extensive effort has been devoted to exploring low FPR and low memory occupancy of Bloom filter, but most of existing studies [2,3,4,5,6] have not achieved significant improvement.

In order to further improve the space efficiency, Bloom filter empowered by machine learning techniques are proposed [7,8,9]. Kraska et al. [7] regard membership testing as a binary classification problem and use a learned classification model combined with traditional Bloom filter. Such a data structure is called Learned Bloom filter (LBF). Based on LBF, Dai et al. [8] propose an Ada-Bloom filter, which divides the scores learned in the machine learning stage into different intervals and uses different discrimination strategies for elements in different score intervals. Considering that when the amount of data in the data stream is close to infinity, the FPR of BF/LBF may be close to 1. Adapting LBF to the extremely large amount of data, Liu et al. [9] propose a Stable LBF, which combines the classifier with an updatable stable Bloom filter to make the filter’s performance decay effect more controllable to achieve a satisfying FPR.

Previous studies primarily focus on single-key membership testing with the learned Bloom filter. This means their methods are designed for scenarios where each element has only one key-value pair. In a multi-key scenario, e.g., storing data in a distributed database across multiple physical nodes, accessing the disks of all physical nodes when a query with filtering conditions arrives results in increased latency. By implementing an efficient and reliable multi-key Bloom filter for each physical node, we can significantly reduce the fan-out of nodes. Additionally, the traditional Bloom filter can only handle queries where all keys are specified. Existing methods often concatenate all keys into one key or input each key sequentially to handle multi-key scenarios as single-key scenarios. However, in most cases, when processing a query where not all keys are specified, the traditional Bloom filter will have a high false positive rate (FPR) because each individual key-value pair may exist, but their combinations may not. Moreover, the traditional Bloom filter applies the same family of hash functions for each key in the query, resulting in a higher FPR.

In order to achieve lower FPR and lower space consumption in multi-key membership testing, in this paper, we propose a method, called Multi-key Learned Bloom Filter (MLBF), to tackle the problem. Similar to other LBF methods, MLBF first classifies the queried elements through a value-interaction-based multi-key classifier, and the elements determined to be absent by the classifier will enter a multi-key Bloom filter for further determination. We propose an optimization strategies for multi-key Bloom filters, to improve the performance of MLBF.

Our contributions can be summarized as follows:

  1. (1)

    To the best of our knowledge, this is the first work that adopts LBF to systematically solve the multi-key membership testing problem.

  2. (2)

    For multi-key membership testing, we propose a Value-Interaction-based Multi-key Classifier (VIMC) model, which does not rely on feature engineering, to learn value interactions between keys to classify multi-key elements. In addition, we propose an adaptive negative weighted cross-entropy loss function to limit the FPR of the LBF’s learning model, thereby reducing the FPR of the entire LBF.

  3. (3)

    We also propose an optimization strategies for multi-key Bloom Filter, i.e., interval-based optimization, and utilize an out-of-distribution (OOD) detection mechanism, making our proposed method more applicable.

  4. (4)

    We report on experiments using real data, offering evidence of the good performance and applicability of the paper’s proposals.

The rest of the paper is structured as follows: We give the problem definition and background of LBF in Sect. 2. Section 3 details the proposed data structures, where their feasibility and performance are analyzed. We report the results from an empirical study in Sect. 4, and Sect. 6 concludes this paper.

2 Preliminaries

In this section, we first formalize the problem of multi-key membership testing and then introduce multi-key Bloom filter that we exploit in the proposed data structures.

2.1 Problem Statement

2.1.1 Multi-key Element Set

A multi-key element set (a.k.a. a multi-key member set), S, is a set of elements \(S = \{e_1, e_2,...,e_n\},\) where n is the number of elements, and \(e_i\in S\) denotes a multi-key element containing c key-value pairs, i.e., \(e_i = \{key_1: v_{1}, key_2: v_{2},...,key_c: v_{c}\}\).

2.1.2 Multi-key Membership Testing

Given a multi-key element set \(S=\{e_1,e_2,..., e_n\}\), and a queried element \(e_q=\{key_{q1}: v_{q1}, key_{q2}: v_{q2},...,key_{qc}: v_{qc}\}\), where \(1\le q1<q2<...<qc\le c\). As shown in Fig. 1, If there is a multi-key element \(e_i\) in the multi-key set S, and \(v_{qi}=v_i\) for all \(key_{qi}=key_i\), element \(e_\text {q}\) exists. That is, query \(e_\text {q}\) exists in S if its values of keys are the same with those of an element in S.

Fig. 1
figure 1

Illustration of the multi-key membership testing

2.2 Multi-key Bloom Filter

A Multi-key Bloom filter consists of c hash function families and a bitmap of size m. When inserting a multi-key element e, e is input into these hash function families to calculate a bit vector, and the mapped positions in the bitmap are set to 1. When querying whether an multi-key element exists, the element is also input into the hash function family to obtain a bit vector, and each bit of the vector is mapped to the bitmap. If all the mapped bits in the bitmap are 1, it means that the queried element exists, otherwise it does not exist.

Given a multi-key element set containing four elements, \(e_1, e_2, e_3, e_4\). Initially, all bits of the bitmap are set to 0. Taking the inserting multi-key \(e_1 = \{ key_1: ``aa'', key_2: ``ab'', key_3: ``ac'' \}\) as an example, \(e_1\) will be hashed to multiple bit positions first. When a query \(e_\text {q}\) is proposed and the same hashing strategy is used for inserting, if all the positions after the hash are 1, the query \(e_\text {q}\) is determined to exist; otherwise it does not exist. It can be seen that the basic operations of the multi-key bloom filter and the single-key bloom filter are very similar. However, we need to consider some additional issues, such as how to hash multi-keys into the bitmap and how to reduce FPR, which we will introduce in detail in Sect.  3.

3 Methodology

3.1 MLBF Framework Overview

The framework has two components: a value-interaction-based multi-key classifier and a multi-key Bloom filter. In the first component, a queried multi-key element is input into a value-interaction-based multi-key classifier, which is trained by a given multi-key element set. The classifier can learn the score of the element, where the queried element exists in the given multi-key element set if the score is greater that a given threshold. When the score is less than the threshold, we will put the queried element into a multi-key Bloom filter for further judgment.

Fig. 2
figure 2

VIMC model overview

3.2 Value-Interaction-Based Multi-key Classifier

To solve the multi-key membership testing problem effectively, it is critical to learn sophisticated key interactions, which can capture the latent correlation between keys. Therefore, we propose a Value-Interaction-based Multi-key Classifier (VIMC) model to learn the value interactions directly from values of keys. The framework of Value-Interaction-based Multi-key Classifier (VIMC) is shown in Fig. 2. Specifically, VIMC encodes the values of the inputs with a DeepFM encoder and then learns the value interactions from the encoded data through a Multi-layer Multi-head Self-Attention (MMSA) module, the outputs of which are fed into a linear layer.

3.2.1 DeepFM Encoder

DeepFM [10] is an efficient end-to-end learning model that can learn both low-order and high-order feature interactions.

Given a multi-key element set, each element has one or more values in each key. Firstly, the continuous values in the set will be discretized, i.e., for \(key_{i}\), its values are divided into \(u_i\) different intervals according to a given segmentation standard \(st_i\). Secondly, we create dictionaries for each key \(key_j\) with discrete value types, where the length of its dictionary is \(dl_j\). Then, an element will be embedded into a one-hot vector \(X = \{x_1,x_2,...,x_L\}\), where \(L = \sum _i u_i+\sum _j dl_j\). When an element contains \(\text {value}_i\), the corresponding \(x_i\) will be set to 1, otherwise it will be set to 0.

In order to emphasize the low-order and high-order value interactions, DeepFM adopts two components, an FM component and a deep component The FM component is a factorization machine, which is used to learn order-1 and order-2 value interactions. Specifically, a linear regression model is introduced to learn the order-1 value interaction, as shown in Eq. 1.

$$\begin{aligned} y_\text {order1} = \langle w,x \rangle + b \end{aligned}$$
(1)

where w is the parameter vector of the linear regression model and b is the bias.

To learn the order-2 value interaction, for \(\text {value}_i\), a latent vector \(V_i\) is introduced to measure its impact of interactions with other values. Specifically, for each value pair (\(\text {value}_i\), \(\text {value}_j\)) in an element, the value interaction between \(\text {value}_i\) and \(\text {value}_j\) is calculated by the inner product of the latent vectors \(V_i\) and \(V_j\). The total order-2 value interaction is calculated as follows:

$$\begin{aligned} y_\text {order2} = \sum _{i=1}^L \sum _{j=i+1}^L \langle V_i,V_j \rangle x_i \cdot x_j \end{aligned}$$
(2)

The output of FM is the summation of the order-1 and order-2 value interaction, which is shown in the following equation:

$$\begin{aligned} y_\text {FM} = y_\text {order1}+y_\text {order2} \end{aligned}$$
(3)

The deep component is a feed-forward neural network, which is used to learn high-order feature interactions. The latent vector matrix containing interactive information is used to encode the value in the deep component. Put differently, each value in an element is mapped to the corresponding latent vector, and the concatenation of latent vectors will be used as the input of the feed-forward network. We take \(FN\) as the feed-forward network, and the output of the deep component for a given query is shown as follows:

$$\begin{aligned} y_\text {Deep} = {\text {FN}}(v) \end{aligned}$$
(4)

where \(v = [V_1,V_2,...,V_m]\), and m is the number of values in the query.

For the purpose of DeepFM training, in contrast to conventional classification tasks, we employ a supervised contrastive learning loss for pretraining. Consequently, we feed the output of the two components into a projection network \(Proj\ ( \cdot )\), responsible for mapping high-dimensional features onto a projection space, and subsequently normalize the network output to lie on the unit hypersphere. We leverage these input values to facilitate supervised contrastive learning.

After training, the latent vectors in the FM component of DeepFM contain order-1, order-2 and high-order value interaction information. Hence, we discard the projection network and use the latent vectors as the embeddings of the corresponding values.

3.2.2 Supervised Contrastive Learning

To fully leverage the information provided by the samples, enhance sample efficiency, and improve the generalization performance of the framework, we employed supervised contrastive learning to train the DeepFM module.

The core principle of contrastive learning is to minimize the distance between an anchor point and its positive samples while maximizing the distance between the anchor point and its negative samples. Self-supervised contrastive learning’s efficacy is largely impacted by the distinction between positive and negative samples, while supervised contrastive learning incorporates label information to better characterize the similarity of intra-class samples. For Bloom filters, the significance of each sample’s key poses a challenge in expanding data samples with noise, as it can lead to unpredictable outcomes. Therefore, we adapted the Loss function of supervised learning called SupCon loss [11]. Specifically, as depicted in Fig. 2, we compute the loss of the feature vectors of each batch of samples output by the projection network as follows:

$$\begin{aligned} \begin{aligned} \mathcal {L}^\text {SCL}&=\ \sum _{i=1}^{N}\mathcal {L}_{i}^\text {SCL}\\ \mathcal {L}_{i}^\text {SCL}&=\frac{-1}{N-1} \ \sum _{j=1}^{N} l_{[ i\ne j]} \cdot l_{[ y_{i} =y_{j}]} \\&\text {log}\frac{\text {exp}( z_{i} \cdot z_{j( i)} /\tau )}{\sum _{k=1}^{N} l_{[ k\ne i]} \cdot \ exp( z_{i} \cdot z_{j( i)} /\tau )} \end{aligned} \end{aligned}$$
(5)

where \(z_{l} =Proj( y_\text {Deep} +y_\text {FM})\), l is an indicator function that takes the value 0 only when \(k=i\) for all samples in the current batch, and 1 otherwise. \(\tau\) is the temperature parameter being optimized.

After pretraining the DeepFM module using supervised contrastive learning, we discard the projection network and freeze the parameters that have already been trained in the DeepFM module during the subsequent classifier training.

3.2.3 Multi-layer Multi-head Self-Attention

Self-Attention [12] uses attention mechanism to calculate the value interaction between each value in the query and other values, and uses attention scores to show the degree of interaction between values.

As shown in Fig. 2, each value in a query is encoded by the DeepFM encoder as a vector containing value interaction information. For simple self-attention, the representation matrix of a query will be mapped into \(Q (\text {Query})\), \(K (\text {Key})\) and \(V (\text {Value})\) matrices by three different mapping matrices \(W_\text {Q}\), \(W_\text {K}\) and \(W_\text {V}\). Similar to DeepFM, self-attention measures the correlation between values by calculating the inner product of Q and K, which can well retain the value interaction learned by the DeepFM encoder. We compute the matrix of outputs as:

$$\begin{aligned} \text {Attention}(Q, K, V) = {\text {softmax}}(\frac{Q \odot K^T}{\sqrt{d_K}})V \end{aligned}$$
(6)

where \(d_K\) is the dimension of each row vector in K, in order to prevent the result from being too large.

Multi-head attention is a combination of multiple self-attention structures, each head learning features in different representation spaces. A multi-head self-attention layer consists of a multi-head self-attention model and a feed-forward network, and the output of multi-head self-attention will pass through the feed-forward network to extract features.

3.2.4 Progressive Self-Knowledge Distillation

To increase the generalization performance of VIMC, we use a simple yet effective regularization method, namely Progressive Self-Knowledge Distillation (PSKD) [13]. PSKD enables the student model to distill its own knowledge, which means that the student becomes teacher itself. Specifically, PSKD utilizes the past predictions of the student model as a teacher to obtain more information during training. The details of this process can be seen in the right part of Fig. 2. Suppose that \(P_t^{stu}(x)\) is the prediction of the student model about the input x at the t-th epoch. Then, the objective at t-th epoch can be written as:

$$\begin{aligned} \begin{aligned} \mathcal {L}_{K D, t}(x, y)&=H\left( \left( 1-\alpha _{t}\right) y+\alpha _{t} P_{t-1}^{s t u}(x), P_{t}^{s t u}(x)\right) \\ H(y, p)&=\frac{1}{N} \sum _{i}-\left[ y_{i} \cdot \log \left( p_{i}\right) +\left( 1-y_{i}\right) \cdot \log \left( 1-p_{i}\right) \right] \end{aligned} \end{aligned}$$
(7)

where H is the binary cross-entropy loss function, \(\alpha _t\) is a hyperparameter that determines how much knowledge from the teacher (i.e., \(P_{t-1}^{stu}(x)\)) to be accepted, y is the hard target, and \((1-\alpha _t)y+\alpha _t P_{t-1}^{stu}(x)\) is the softening target for the input x in the t-th epoch of training.

However, there is a problem with \(\alpha _t\) that the model does not have enough knowledge in the early stages of training. In such a situation, \(\alpha _t\) should be tuned down. For this reason, PSKD increases the value of \(\alpha _t\) as t grows. Therefore, \(\alpha _t\) is computed as follows:

$$\begin{aligned} \alpha _t = \alpha _\text {T} \times \frac{t}{T}, \end{aligned}$$
(8)

where T is the total number of epochs for model training, and \(\alpha _\text {T}\) is the value of \(\alpha _t\) at the last epoch that is determined via validation process.

3.2.5 Adaptive Negative Weighted Cross-Entropy Loss Function

As with traditional binary classification tasks, we initially considered the balanced cross-entropy loss function. Suppose there are N input samples to be predicted. Balanced loss is shown as follows:

$$\begin{aligned}{} & {} \text {Loss}_\text {balanced} = \frac{1}{N} \sum _i - \left[ \alpha \cdot y_i \cdot log(p_i) + (1-\alpha ) \cdot (1-y_i) \cdot log(1-p_i)\right] \end{aligned}$$
(9)

where \(y_i\) is the label of the input, \(p_i\) is the probability of being predicted to be positive, \(\alpha\) is a weight parameter. Suppose \(C_\text {p}\) and \(C_\text {n}\) are the number of positive and negative examples, respectively. So, the value of \(\alpha\) can be calculated from \(\frac{\alpha }{1-\alpha } = \frac{C_\text {n}}{C_\text {p}}\).

Focal loss [14] allows the binary classification model to focus on learning samples that are difficult to learn and solves the problem of class imbalance to a certain extent. Focal loss is shown as follows:

$$\begin{aligned}{} & {} \text {Loss}_\text {focal} = \frac{1}{N} \sum _i - [(1-p_i)^\gamma \cdot y_i \nonumber \\{} & {} \quad log(p_i) + (p_i)^\gamma \cdot (1-y_i) \cdot log(1-p_i)] \end{aligned}$$
(10)

where \(\gamma\) is a hyperparameter, \(p_i\) reflects how close the prediction is to the ground truth, i.e., \(y_i\). Focal loss increases the weight of the hard samples in the loss function through the modulation of \(\gamma\) and \(p_i\), so that the loss function tends to solve the hard samples, which helps to improve the accuracy of the hard samples.

However, neither the balanced loss nor the focal loss is suitable for the MLBF data structure. As discussed above, for a standard LBF, a query judged by the learned model as a positive example will not be checked again, which leads to false positive examples. Therefore, for the learned model in LBF, false positives are often more unacceptable than false negatives. At the same time, in many scenarios, negative examples that can be used for training are more difficult to obtain than positive examples, which leads to more inaccurate predictions for negative examples.

In order to solve the above problem, we propose an adaptive negative weighted cross-entropy loss function. Specifically, we give each negative example an adaptive weight, which is shown as follows:

$$\begin{aligned} \text {Loss}_\text {ada} = \frac{1}{N} \sum _i - \left[ y_i \cdot log(p_i) + exp(\gamma \cdot p_i)\cdot (1-y_i) \cdot log(1-p_i)\right] , \end{aligned}$$
(11)

where \(y_i\) is the label of the input element \(e_i\), \(p_i\) is the probability that \(e_i\) is predicted to be positive, \(\gamma\) is a hyperparameter, and \(\gamma\) and \(p_i\) determine the weight of the negative element \(e_i\). Suppose that the label \(y_i\) of the input element \(e_i\) is 0. When the predicted value \(p_i\) is larger, the greater weight (i.e., \(exp(\gamma \cdot p_i)\)) is added to the loss of this negative example, which can make the prediction more accurate and can greatly reduce the FPR.

3.2.6 Out of Distribution Detection

Learned Bloom filter can iteratively improve its performance through the learning process. Nevertheless, in practical scenarios with a high volume of queries, the system may encounter out-of-distribution (OOD) queries that do not conform to the original data distribution. These queries may lead to false positives, potentially compromising the Bloom filter’s reliability and performance. Therefore, establishing mechanisms for detecting and handling OOD queries is essential to ensure the accuracy and robustness of the Bloom filter.

We have adopted a loss-based OOD detection mechanism to identify whether OOD situations exist, by establishing a sampling distribution for new and old data using their average loss values, and conducting statistical tests on the distribution[15]. Specifically, we maintain two sets, \(S_\text {old}\) and \(S_\text {new}\), each containing \(N\) bootstrapped samples of new and old data, respectively. Note that the bootstrapped samples here refer to a batch of resampled samples. When detection is necessary, we calculate the standard deviation \(std\) based on the sampling distribution of \(S_\text {new}\). Periodically, we use the latest converged model to calculate the average loss value for each sample in both sets and create a sampling distribution. We then use the linear difference between the average loss values of bootstrapped samples from new and old data as the test statistic as follows:

$$\begin{aligned} d( S_\text {old} ,S_\text {new}) \ =\ \left| \frac{1}{| S_\text {old}| }\sum _{s\in S_\text {old}} L( s;\theta ) -\frac{1}{| S_\text {new}| }\sum _{s\in S_\text {new}} L( s;\theta )\right| \end{aligned}$$
(12)

where \(L\) is the loss function used for training the model with parameter \(\theta\). Finally, we compare the test statistic and the threshold, and if \(d( S_\text {old},S_\text {new}) > 2 \times std\), which is equivalent to p-value \(<\delta\) where \(\delta = 0.5\) in a two-sample test procedure, we consider that the data have undergone a significant change and issue an OOD signal, update the model accordingly.

3.3 Multi-key Bloom Filter

In this section, we present the details of the multi-key bloom filter part of the model. The Multi-key Bloom Filter (MBF) contains a bitmap of m-bit size. When the processed multi-key contains c keys, MBF will create c hash function families, and all hash functions are independent of each other. The most basic operations of MBF are Insert(S, e) and Query(S,\(e_\text {q}\)), where S is the element member set.

Insert(S, e): For each element \(e = \{k_{1}, k_{2},..., k_{c}\}\) that needs to be inserted into S, each key is associated with a family of hash functions. For the i-th key value in e, we use the i-th hash function family to hash it to the specific positions \(MBF[h_{i1}(e[i])]\),\(MBF[h_{i2}(e[i])]\),..., \(MBF[h_{ik}(e[i])]\) in the bitmap, and set corresponding positions to 1.

Query(S, \(e_\text {q}\)): The Query operation is similar to the Insert operation. It also calculates the positions \(h_{i1}(e_i), h_{i2}(e_i),..., h_{ik}(e_i)\) of \(e_\text {q}\) after being hashed by multiple hash function families. Then, MBF will check whether the corresponding bits \(MBF[h_{i1}(e_i)], MBF[h_{i2}(e_i)],..., MBF[h_{ik}(e_i)]\) are all 1 in the bitmap. If they are, \(e_\text {q}\) exists, otherwise it will return it does not exist.

3.3.1 Interval-Based Optimization

MBF uses the same number of hash functions for all keys, without considering the data distribution of each key. In fact, the data distribution of each key is not consistent, e.g., some keys are uneven, and others are uniform. Considering this situation, we propose an Interval-based Multi-key Bloom Filter (IMBF). IMBF divides the key into specific intervals according to the data distribution, where different intervals use different numbers of hash functions.

More specifically, suppose that each element has c keys. For the i-th key, the probability that the bit is still not set to 1 after being hashed k times is as follows:

$$\begin{aligned} \left(1-\frac{1}{m}\right)^{k} \end{aligned}$$

After inserting n elements, the probability that a bit is still 1 is as follows:

$$\begin{aligned} 1-\left( 1-\frac{1}{m}\right) ^{\sum _{j=1}^{c}n_{j}k_{j}} \end{aligned}$$

where \(n_j\) is the number of distinct elements in the i-th key. After inserting n multi-key elements, the total number of hashes is \({\sum _{j=1}^{c}n_{j}k_{j}}\). Therefore, for the i-th key, the expected value of its FPR is expressed as follows:

$$\begin{aligned} E(\text {FRP}_i)=\left( 1-\left( 1-\frac{1}{m}\right) ^{\sum _{j=1}^{c}n_{j}k_{j}}\right) ^{k_i} \end{aligned}$$
(13)

We assume that the data distributions of all keys are independent with each other, which is reasonable in practice. We can get the FPR of all keys as follows, where \(P_i=\frac{n_j}{n}\), and n is the total number of multi-keys. For simplicity, we assume that the total number of hash functions for all keys is a constant \(K_\text {sum}\), i.e., \(K_\text {sum}=\sum _{i=1}^{c}k_i\).

$$\begin{aligned} \begin{aligned} E(\text {FRP})&=\prod _{i=1}^{p}E(\text {FPR}_i) =\left( 1-\left( 1-\frac{1}{m}\right) ^{\sum _{j=1}^{c}n_{j}k_{j}}\right) ^{\sum _{i=1}^{c}k_i}\\&\approx \left( 1-e^{-\frac{1}{m}\sum _{j=1}^{c}n_{j}k_{j}}\right) ^{K_\text {sum}} \approx \left( 1-e^{-\frac{n}{m}\sum _{j=1}^{c}P_{j}k_{j}}\right) ^{K_\text {sum}} \end{aligned} \end{aligned}$$
(14)

At this point, we can transform the problem that aims to minimize \(\{E(\text {FPR})\}\) into the problem that is to minimize \(\{\sum _{j=1}^{c}n_{j}k_{j}\}\). According to the AM-GM inequality, when \(n_jk_j \ge 0\), there is an inequality \(\sum _{j=1}^{c}n_{j}k_{j}\ge \root c \of {\prod _{i=1}^{c}n_{j}k_{j}}\), and the two items are equal if and only if each item on the left side is equal. Therefore, in order to achieve the optimization goal \(min\{\sum _{j=1}^{c}n_{j}k_{j}\}\), we need to set a smaller \(k_j\) for a larger \(n_j\), and a larger \(k_j\) for a smaller \(n_j\).

Empirically, for a small amount of data, we can directly calculate the size of \(n_j\). If the amount of data is huge, we can use sampling or some cardinality estimation methods such as HyperLogLog Counting [16] and Adaptive Counting [17]. In the experiments of this paper, we use the method of sampling estimation.

Based on this idea, we can further simplify our model. The ratio of the number of unique elements to the total number of elements is P. We can divide P into intervals and use different numbers of hash functions for different intervals. For simplicity, we give an interval division parameter I. When I=4, we divide the interval of P into four parts, namely \(I_1\)=[0, 25%), \(I_2\)=[25%, 50%), \(I_3\)= [50%, 75%) and \(I_4\)=[75%, 100%]. Based on the previous derivation, \(I_4\) uses fewer hash functions, and \(I_1\) uses more hash functions. We set the number of hash functions between the two intervals to differ by 1. Combining with the learned classifier (i.e., VIMC in Sect. 3.2), we have an improved MLBF called Interval-based Multi-key Learned Bloom Filter (IMLBF).

4 Experimental Evaluation

We evaluate the performance of the multi-key classifier, and the multi-key Bloom filter on real data, respectively. All experiments are implemented on Intel(R) Core(TM) i7-10700 CPU @ 2.90GHz with 32 GB RAM.

4.1 Datasets

We use three datasets, IMDBmovies, CriteoCTR, and MaliciousURLs, to simulate multi-key inserts and queries. The first two datasets are employed to simulate key forecasting scenarios, given the significant amount of missing data in both sets. The URL dataset, encompassing all available data, is employed to evaluate the performance of MLBF under high query load and exemplify its application in malicious website identification.

4.1.1 IMDBmovies

IMDBmovies comes from IMDB,Footnote 1 an online database of movies, television programs, etc. IMDBmovies has 80, 000 movie reviews, each with information such as movie length, average rating, and number of directors/actors. Each data element contains string, number, and other types of fields, which meets the multi-key membership testing scenarios. We remove redundant elements and the columns with massive missing values. We treat the original data as positive samples. Since the dataset does not contain explicit negative data, we randomly select an element from each column to form a new element as a negative sample.

4.1.2 Criteo CTR

Criteo CTR is an online advertising dataset released by Criteo Labs.Footnote 2 It contains feature values and click feedback of display ads, and this dataset can be used as a benchmark for click-through rate (CTR) prediction. Each advertisement has the function of describing data. To simulate multi-key membership testing, the advertisements with a label of 0 are regarded as negative samples, and others are regarded as positive samples.

4.1.3 MaliciousURLs

MaliciousURLs consists of a total of 651,191 URLs, including 428,103 benign URLs, 96,457 defacement URLs, 94,111 phishing URLs, and 32,520 malware URLs, which has been published on Kaggle.Footnote 3 We classify all the malicious URLs as positive samples, while categorizing the rest benign URLs as negative samples. Moreover, to assess the performance of our method in multi-key scenarios, we will divide each URL into various keys, such as hostname and port.

4.2 Experiment Results

4.2.1 Performance of VIMC

In this set of experiments, we evaluate the performance of the multi-key classifier. Firstly, we compare our method with the baseline methods on the real-world datasets. Secondly, we verify the effectiveness of each component of the proposed model through ablation experiments. Then, we show the FPR of VIMC at different thresholds to demonstrate the significant effect of the proposed model on reducing FPR. Finally, we analyze the hyperparameter settings of \(\gamma\) for different loss functions to prove that our loss function can further reduce FPR.

Fig. 3
figure 3

Model accuracy comparison

Accuracy of VIMC In order to verify the effectiveness of our proposed Value-Interaction-based Multi-key Classifier (VIMC) model, we introduce five mainstream models as baselines, including DeepFM [10], DNN [18], LSTM [19], Linear [20] and Random Forest (RF) [21].

We use accuracy to evaluate the performance of the six methods, which is a commonly-used metric to measure the performance of deep learning models and can be calculated as follows:

$$\begin{aligned} \text {Accuracy }= (\text {TP} + \text {TN}) / (P + N) \end{aligned}$$
(15)

where \(\text {TP}\) is the number of true positives, \(\text {TN}\) is the number of true negatives, P is the number of positives, and N is the number of negatives.

We report the accuracy of the methods in Fig. 3. It is obvious that, on both datasets, our proposed VIMC has the best performance. VIMC outperforms the best (i.e., DeepFM) among the baseline methods by \(2.95\%\), \(0.97\%\) and \(2.24\%\) on IMDBmovies, Criteo CTR and MaliciousURLs, respectively. Since both datasets contain a large portion of missing attributes, it is a good evidence for the effectiveness of VIMC in reducing FPR of MLBF and IMLBF for partial key queries.

Table 1 Ablation study results based VIMC

Ablation Study In this experiment, we conduct detailed ablation study by comparing the accuracy(ACC) and binary cross-entropy loss (BCE) of different the proposed model variants on both datasets.

We first introduce the three variants each with an optimization component removed. VIMC-DeepFM is a variant based on VIMC, removing DeepFM Encoder. The variant named VIMC-pskd removes the progressive self-knowledge distillation from VIMC. And VIMC-DeepFM-pskd means both components are removed.

As expected, all three variants perform worse than VIMC. As shown in Table 1, no matter which part is removed, the accuracy decreases and the loss increases. Specifically, after deleting the DeepFM and the progressive self-knowledge distillation from VIMC, the accuracy drops by about 1.6%, 1.0% and 1.5%, while the binary cross-entropy loss increases by about 0.04, 0.02 and 0.02 on the IMDBmovies, Criteo CTR and MaliciousURLs datasets, respectively. Therefore, we verify that both components of our proposed model are able to learn value interactions from the data and enhance the performance of multi-key classifier.

FPR of VIMC We now evaluate accuracy (ACC) and false positive rate (FPR) of VIMC at different thresholds. Results are reported in Table 2. Through this experiment, we can effectively evaluate the ability of VIMC to reduce false positives. The threshold varies in the range of [0.5, 0.9]. It can be seen from the table that as the threshold increases, the accuracy of VIMC decreases, but the FPR of VIMC also decreases. A higher threshold can effectively reduce the false positives produced by VIMC, which is very important for LBF. At the same time, we can also see that even if the threshold is set to 0.9, the attenuation of model capability is still acceptable.

Table 2 FPR of VIMC at different thresholds
Fig. 4
figure 4

Effect of \(\gamma\)

Effect of \(\gamma\). \(\gamma\) is the hyperparameter of loss function in the weighted methods. We compare the level of FPR reduction in focal loss and adaptive loss at different \(\gamma\). In this experiment, we train the model at different \(\gamma\) (0, 0.2, 0.5, 1.0, 1.5, 2.0, 5.0) while keeping other parameters unchanged. In addition, VIMC means that no weighted loss function is used. So there is no change in VIMC when \(\gamma\) increases. As illustrated in Fig. 4, both focal loss and adaptive loss methods show decreasing FPR with increasing \(\gamma\). But the FPR of our proposed method on all datasets drops significantly faster than focal loss. In particular, when \(\gamma =5\), the FPR values of our method are \(4.95\%\), \(11,71\%\), \(1.23\%\) lower than focal loss on IMDBmovies, Criteo CTR, and MaliciousURLs, respectively. Because our method can adaptively weight negative examples for different prediction probability intervals, it can effectively reduce FPR. In another word, VIMC based on adaptive negative weighted method can be used as a multi-key classifier with extremely low FPR level, which is well adapted to bloom filter.

Table 3 Sensitivity of OOD detection

OOD Detection Sensitivity To evaluate the sensitivity of the OOD detection module, we conducted experiments by randomly selecting 20% of the dataset as test data, which mimics new batches of data that may be encountered in the future. Within this test data, we randomly selected a subset of samples with a proportion equivalent to the ratio \(\beta\) of the entire dataset, and modified them to deviate from the original distribution. The performance of the OOD detection component was evaluated on the same trained model, and the experimental results were reported in Table 3. If \(d-2\times std > 0\), it is deemed that out-of-distribution (OOD) has occurred.

The relative accuracy here reflects the change in model accuracy before and after inserting out-of-distribution (OOD) data. Experimental results on three datasets show that the model’s accuracy significantly degrades when encountering OOD data. Nevertheless, the OOD detection method we adopted is effective in identifying this situation and does not generate false positives for minor data perturbations.

4.2.2 Performance of Multi-key Learned Bloom Filter.

In this set of experiments, we evaluate the performance of multi-key Bloom filter. Two main metrics are compared for the above methods, i.e., False Positive Rate (FPR), and CPU time for a query. We perform all negative queries and report the average CPU time for a query. We study the following data structures.

  1. (1)

    SMBF The Standard Multi-key Bloom Filter data structure that uses only a single Bloom filter, which has the same structure and settings with the Bloom filter in our MLBF.

  2. (2)

    IMBF Our Interval-based Multi-key Bloom Filter. This data structure uses the interval-based optimization method on the basis of SMBF

  3. (3)

    Ada-BF Ada-BF adjusts FPR adaptively by tuning the number of hash functions in different regions [8]. The classifier used in this study is random forest. To facilitate multi-key membership testing for Ada-BF, the keys of each query in the dataset used for the study are concatenated.

  4. (4)

    MLBF Our Multi-key Learned Bloom Filter. This variant of SMBF includes the predictor part, i.e., VIMC.

  5. (5)

    IMLBF Our Interval-based Multi-key Learned Bloom Filter, which uses the interval-based optimization method on the basis of MLBF.

In the following, we first study the effect of I, a parameter in the interval-based optimization method denoting the number of intervals, to validate the sensitivity of both metrics to I. Then, we evaluate the performance of our proposed data structures after using two optimization methods. All the above experiments are performed on three datasets.

Fig. 5
figure 5

Effect of I on FPR

Fig. 6
figure 6

Effect of I on CPU time

Effect of I. In IMBF, we obtain lower FPR by using different number of hash functions for different keys, and we use the hyperparameter I to adjust the number of hash functions. To facilitate the experiment, we compare only MBF and IMBF in this section, which are the Bloom filter parts of MLBF and IMLBF, respectively. As shown in Figs. 5 and 6, we compare the FPR and CPU time of MBF and IMBF with different I by changing the bitmap size from 150Kb to 240Kb. As we increase the bitmap size, all methods show a decreasing trend in FPR, because a larger bitmap implies a smaller hash function collision probability. At the same time, the CPU time of all methods shows a decreasing trend when the bitmap size is larger since it means that the probability of a bit being 0 is also greater. When querying whether the element exists, if the bit of a mapping is 0, the methods directly return the result that the data do not exist, i.e., a larger bitmap leads to a higher probability of taking less time to return the result. Also, when fixing the size of the bitmap, it can be seen that a larger hyperparameter I corresponds to a smaller FPR, which is consistent with the results derived from Eq. 14.

Fig. 7
figure 7

Comparison of different methods on FPR

Fig. 8
figure 8

Comparison of different methods on CPU time

Comparison of Different Methods.We proceed to compare the FPR and CPU time for four methods at different bitmap sizes, namely SMBF, Ada-BF, MLBF and IMLBF. The CPU time for all LBF-related data structures includes the time spent on model predictions. This experiment also strictly adhered to the parameter settings stated at the beginning of Sect. 4.2.2. As shown in Figs. 7 and 8, with our proposed multi-key classifier (i.e., VIMC), the FPR of all learned BF reduced remarkably, at the price of minor extra CPU cost. We also observe that IMLBF always performs better than others in terms of FPR, regardless of the bitmap size, which shows the superiority of our optimization strategies. Moreover, with the growing bitmap size, the CPU time decreases for all methods, which is due to the fact that larger bitmaps are more likely to return results earlier in a query. And the FPR of all methods shows a decreasing trend as the bitmap size increases. For the same reason as in the previous experiment, because a larger bitmap means that the hash function is more likely to be collision-free, there is a smaller probability of a false positive occurring. Experiments on the MaliciousURLs dataset demonstrate the proposed method’s ability to maintain a low FPR and achieve a running time of approximately 0.1 ms when dealing with large-scale datasets and high query loads. This suggests that the method is feasible for online membership testing scenarios, e.g., the identification of malicious websites.

5 Related Work

5.1 Bloom Filter

Bloom Filter (BF) [1] was designed by Bloom in 1970 and is widely used for membership testing that is deployed in various domains. For example, some website servers use BF to lock malicious IP addresses [22]. The distributed databases, such as Google Bigtable [23] and Apache Cassandra, use BF to avoid unnecessary disk access to optimize their space efficiency. Even Bitcoin [24] uses BF to determine whether the wallet is synchronized successfully. To meet different requirements (e.g., high lookup performance and low memory consumption), various BF variants have been proposed. Compressed BF [6] uses arithmetic encoding to further compress the space of BF.

5.2 Learned Bloom Filter

Kraska et al. [7] improve the traditional BF by adding a learning classifier before the BF, where the classifier first learns the data distribution when a new queried element arrives and then determines whether the element exists in a given element set. For non-existent elements, an additional BF is used to determine whether they exist or not. This improved BF is called Learned Bloom Filter (LBF). A large number of studies [8, 25, 26] have proved that LBF can optimize the traditional BF, especially in reducing FPR and memory consumption.

6 Conclusion

We propose and offer solutions to a novel multi-key membership testing problem. In order to achieve low False Positive Rate (FPR) and low memory consumption, we give a Multi-key Learned Bloom Filter (MLBF) data structure that combines a value-interaction-based multi-key classifier and a tailor-made multi-key Bloom filter. Further, a improved MLBF data structure, i.e., Interval MLBF, is proposed to improve the multi-key membership testing performance. To the best of our knowledge, this is the first study that considers multi-key membership testing and multi-key learning Bloom filter. An extensive empirical study with real data offers evidence that the paper’s proposals can significantly reduce the FPR during membership query while offering acceptable query efficiency.