Selectivity Estimation on Set Containment Search

In this paper, we study the problem of selectivity estimation on set containment search. Given a query record Q and a record dataset S\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathcal {S}}$$\end{document}, we aim to accurately and efficiently estimate the selectivity of set containment search of query Q over S\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathcal {S}}$$\end{document}. We first extend existing distinct value estimating techniques to solve this problem and develop an inverted list and G-KMV sketch-based approach IL-GKMV. We analyze that the performance of IL-GKMV degrades with the increase in vocabulary size. Motivated by limitations of existing techniques and the inherent challenges of the problem, we resort to developing effective and efficient sampling approaches and propose an ordered trie structure-based sampling approach named OT-Sampling. OT-Sampling partitions records based on element frequency and occurrence patterns and is significantly more accurate compared with simple random sampling method and IL-GKMV. To further enhance the performance, a divide-and-conquer-based sampling approach, DC-Sampling, is presented with an inclusion/exclusion prefix to explore the pruning opportunities. Meanwhile, we consider weighted set containment selectivity estimation and devise stratified random sampling approach named StrRS. We theoretically analyze the proposed techniques regarding various accuracy estimators. Our comprehensive experiments on nine real datasets verify the effectiveness and efficiency of our proposed techniques.


Introduction
Set-valued attributes are ubiquitous and play an important role in modeling database systems in many applications such as information retrieval, data cleaning, machine learning and user recommendation. For instance, such set-valued attributes may correspond to the profile of a person, the tags of a post, the domain information of a webpage and the tokens or q-grams of a document. In the literature, there has been a variety of interests in the computation of set-valued records including set containment search (e.g., [9,21,28,36]), set similarity joins (e.g., [31,33]) and set containment joins (e.g., [13,24,25,34]).
In this paper, we focus on the problem of selectivity estimation of set containment search. Considering a query record Q and a collection of records S where a record consists of an identifier and a set of elements (i.e., terms), a set containment search retrieves records from S which are contained by Q, i.e., {X|X ∈ S ∧ Q ⊇ X} , where Q contains X ( Q ⊇ X ) if all the elements in X are also in Q. Table 1 shows an example with eight records in a dataset and a query record Q where Q contains X 2 , X 3 and X 5 . Selectivity (cardinality) of a query refers to the size of the query result. For instance, the selectivity of Q in Table 1 is 3.
Selectivity estimation on set containment search aims at estimating the cardinality of the containment search. As an essential and fundamental tool on massive collections of set values, the problem has a wide spectrum of applications because it can provide users with fast and useful feedback. As a simple example, when introducing a new product to the market, its characteristics and features could be described as a set of keywords. Assume a preference dataset consists of such characteristics and features desired by users from online survey. Size estimation of the new product descriptions on the preference dataset estimates the total number of users who may be interested in the product and could serve as a prediction of the product's market potential. In another example, companies may post positions in an online job market Web site where a position description contains a set of required skills. A job seeker may want to have a basic understanding of the job market by obtaining the total number of active job vacancies that he/she perfectly matches (i.e., the skill set of the job seeker contains the required skills of the job).

Challenges
The key challenges of selectivity estimation on set containment search come from the following three aspects. First, the dimensionality (i.e., the number of distinct elements) is high. Shingle (n-gram)-based representations for strings are common in practice [26]. Typical (first-order) shinglebased representations of a string of sentence are a collection of words each of which is separated by a space. Highorder shingles are used to represent the strings with different combinations of words. As shown in our empirical studies, the vocabulary size in real-world dataset could reach more than 3 million when the high-order shingles are used. This makes the selectivity estimation techniques which are sensitive to dimensionality inapplicable to our problem. Second, the number of records in the dataset could be very large. Moreover, the length of query and data record may also be large. To deal with the sheer volume of the data, it is desirable to efficiently and effectively provide approximate solutions. Third, the distribution of element frequency may be highly skewed in real applications. It is desirable to devise sophisticated data-dependent techniques to properly handle the skewness of data distribution to boost the accuracy.
Even though selectivity estimation has been widely explored, most of the existing techniques cannot be trivially applied to handle the problem studied in this paper. We discuss two categories of techniques which can be extended to support the selectivity estimation problem, range counting estimating (e.g., [8,15]) and distinct value estimating [12,16].
Given the element universe (vocabulary) E , a record X i can be regarded as an |E|-dimensional binary vector, where X ij = 1 if element e j appears in X i ( e j ∈ X i ) and X ij = 0 otherwise, for 1 ≤ j ≤ |E| . Let n denote the vocabulary size |E| . Under this context, the dataset S can be modeled as a set of points in {0, 1} n where each record corresponds to an n-dimensional point and the query is a hypercube in {0, 1} n . Thus, we can rewrite the selectivity estimation problem as the approximate range counting problem in computational geometry. However, the approximate range counting problem suffers from the curse of dimensionality where the computing cost is exponentially dependent on dimensionality n [16,27]. As the vocabulary size is usually large, applying range counting estimating methods to our problem is not applicable.
Distinct value estimators (e.g., KMV [12], bottom-k, min-hash [16]) can effectively support size estimation for set operations (e.g., union and intersection) and are widely used for problems of size estimation under different contexts. In Sect. 3.2, we show how to extend the distinct valuebased estimator to the problem studied in this paper combining with inverted list techniques. We also analyze that the performance of distinct value estimator-based approach degrades when the vocabulary size is large due to the inherent superset containment semantics of the problem studied in this paper. Wang et al. [32] study selectivity estimation on streaming spatio-textual data where the textual data are a set of keywords/terms (i.e., elements). However, the query semantic is different as it specifies a subset containment search on the textual data, i.e., the keywords (elements) in the query should be contained by the keywords from spatial objects. This is different from the superset query semantic in our problem which is more challenging to handle using distinct value estimators as discussed in Sect. 3.2.

Contributions
Motivated by the challenges and limitations of existing techniques, in the paper we aim to develop efficient and effective sampling-based approaches to tackle the problem. Naively applying random sampling over the dataset ignores the element frequency distribution and results in the compromised performance. Intuitively, combinations of high-frequency elements (i.e., frequent patterns) occur among data records with high frequency, and records with similar frequent patterns are more likely to be contained by the same query. Thus, we use the frequent patterns as labels and partition records by these labels to boost the efficiency and accuracy. Moreover, assume that the elements are ordered based on frequency, we use ordered trie structure to maintain partitions of the dataset and present OT-Sampling method. This ordered trie-based approach, though demonstrated to be highly efficient and accurate, does not consider element distribution of the query Q. Inspired by the observation that query Q must include a subset of record X in order to contain X, efficient pruning techniques are developed on the partitions of dataset. We further propose a divide-and-conquerbased sampling approach named DC-Sampling which only conducts sampling on the qualified partitions surviving from the pruning. The principle contributions of this paper are summarized as follows.
• This is the first work to systematically study the problem of selectivity estimation on set containment search, which is an essential tool for set-valued attributes analyses in a wide range of applications. • Two baseline algorithms are devised. The first algorithm is based on random sampling. We also extend distinct value estimator G-KMV sketch and propose an inverted list-based approach IL-GKMV. Insights about the limitations of the two baseline approaches are theoretically analyzed and empirically studied. • We develop two novel sampling-based techniques: OT-Sampling and DC-Sampling. OT-Sampling integrates ordered trie index structure to group the dataset and achieves higher accuracy by capturing the element frequency and frequent patterns. DC-Sampling employs divide-and-conquer philosophy and an exclusion/inclusion-set prefix to further improve the performance by exploring pruning opportunities and skipping sampling on pruned partitions of the dataset. • We consider the selectivity estimation problem with respect to weighted set containment search, which is a generalization of the simple set containment search problem. A naive random sampling method and a stratified sampling approach are proposed to tackle this problem. • Comprehensive experiments on a variety of real-life datasets demonstrate the superior performance of the proposed techniques compared with baseline algorithms.

Preliminary
In this section, we first formally present the problem of containment selectivity estimation and then give some preliminary knowledge. The notations used throughout this paper are summarized in Table 2.

Problem Definition
Suppose the element universe is E = {e 1 , e 2 , … , e n } . Each record X consists of a set of elements from domain E . Let S be a collection of records {X 1 , X 2 , … , X m } . Given two records X and Y, we say X contains Y, denoted as X ⊇ Y , if all elements of Y can be found in X. In the paper, we also say X is a superset of Y or Y is a subset of X. Given a query record Q and a dataset S , a set containment search of Q over S returns all records from S which are contained by Q, i.e., {X|X ∈ S, Q ⊇ X} . We use t to denote the selectivity (cardinality) of the set containment search. The selectivity of Q measures the number of records returned by the search; namely, t = |{X|X ∈ S, Q ⊇ X}|.
Considering the containment relationship between a given query Q and a record X i ∈ S ( 1 ≤ i ≤ m ), let i be the indicator function such that and then, the selectivity of the set containment search on dataset S with respect to the query Q can also be calculated as t = ∑ X i ∈S i .

Problem Statement
In this paper, we investigate the problem of selectivity estimation on set containment search. Given a query record Q and a dataset S , we aim to accurately and efficiently estimate the selectivity of the set containment search of Q on S.
Hereafter, whenever there is no ambiguity, selectivity estimation on set containment search is abbreviated to containment selectivity estimation.

Weighted Set Containment Search
Weighted records are common in real world. For example, the product reviews in Amazon can be modeled as weighted dataset, where each user corresponds to one record and every entry in the record is a rating score of one product. The formal definition of weighted records is as follows.
Definition 1 (Weighted records) Given the element universe E = {e 1 , e 2 , … , e n } , an n-dimensional weighted record Sampling probability in P i t i Containment selectivity of Q in P i X i is a set of n elements w ij 's (i.e., where w ij is the weight of X i with respect to e j and w ij belongs to the domain D j ∈ R (R is one-dimensional real space).
When setting D j = {0, 1} , the (unweighted) records are obtained. Next, we present the definition of set containment search problem on weighted records.
Given the definition of weighted records inclusion, we formally present the set containment search problem on weighted records.
Definition 3 (Weighted set containment search) Given a weighted query record Q = {q 1 , q 2 , … , q n } , the weighted set containment search retrieves all the records X i 's from weighed dataset S that satisfy X i ⊆ w Q.
Our goal is to estimate the size of the query result given a query Q, i.e., the selectivity of weighted set containment search. Obviously, when the domains of weight are set as {0, 1} , the weighted set containment search problem degenerates to the simple set containment search problem.

Estimation Measure
In order to evaluate the accuracy of containment selectivity estimation, we apply the mean square error (MSE) to measure the expected difference between an estimator and the true value. The MSE formula is as follows, where t is an estimator for t. If t is an unbiased estimator, the MSE is simply the variance.

KMV Synopses
The k minimum value (KMV) technique first introduced in [11] is to estimate the number of distinct elements in a large dataset. Given a no-collision hash function h which maps elements to range [0, 1], a KMV synopses of a record (set) X, denoted by L X , is to keep k minimum hash values of X. Then, the number of distinct elements |X| can be estimated by | X| = k−1 where U (k) is kth smallest hash value. Beyer et al. [12] also methodically analyze the problem of distinct element estimation under set operations. As for union operation, consider two records X and Y with corresponding KMV synopses L X and L Y of size k X and k Y , respectively. In [12], An unbiased estimator for the number of distinct elements in X ∪ Y , denoted by D ∪ = |X ∪ Y| , is as follows.
The variance of D ∪ , as shown in [12], is As shown in [12], Eq. 3 can be modified to compound set operation where An improved KMV sketch, named G-KMV, is proposed to estimate the multi-union size in [32]. G-KMV imposes a global threshold and ensures that all hash values smaller than the threshold will be kept. Considering a union operation ⋃ X i with the sketch as L = L X 1 ∪ L X 2 … ∪ L X n , the sketch size k for the union is k = |L X 1 ∪ L X 2 … ∪ L X n | . The estimation variance by G-KMV method is smaller than that of simple KMV method under reasonable assumptions as analyzed in [35].

Baseline Solutions
In this section, we introduce two baseline solutions following simple random sampling and G-KMV sketching techniques, respectively.

Random Sampling Approach
A simple way to tackle the set containment estimation problem is to adopt the random sampling techniques and conduct set containment search over a sampled dataset S ′ which is usually much smaller compared with the original dataset S . After getting the selectivity of Q on sampled dataset S ′ , we scale it up to get an estimation of containment selectivity regarding S.
Given sampling size budget b in terms of number of records, we describe the random sampling-based approach in the following two steps: (1) uniformly at random sample b with the query Q and assign i accordingly. Recall that i is the containment indicator for a record X i as shown in Eq. 1. Based on this, the containment selectivity estimator ( t R ) of the random sampling approach is: Note that i is a binary random variable because of the random sampling on records. Next, we show that the estimator for baseline solution t R is an unbiased estimator and then derive its variance. We first compute the probability of the event { i = 1} . Let t denote the containment selectivity over dataset S with respect to query Q, i.e., t = |{X|X ∈ S, Q ⊇ X}| , then Pr[ i = 1] = t m where m is total number of records, and thus, the expectation of i is E[ i ] = t m . By the linearity of expectation, we get the expectation of the estimator for baseline solution in Eq. 5 is E[t R ] = t , and the variance is

IL-GKMV: Inverted List and G-KMV Sketch-Based Approach
The random sampling method, which is very efficient, may result in poor accuracy because it ignores the data distribution information, e.g., the distribution of element frequency or record length. In this section, we develop containment selectivity estimation techniques which are data dependent by utilizing the inverted list and G-KMV sketch techniques.
In the first step, we build an inverted index I on the dataset S where an element (token) e i is associated with a list of record identifiers such that the corresponding records contain the element e i [10]. For instance, in Table 1, the inverted list of element e 3 is {X 1 , X 2 , X 5 } . Let f i denote the frequency of an element e i , i.e., the size of the inverted list I e i ; let Pr[e i = 1] denote the probability that a record in a dataset contains the element e i , then we have Similarly, given a record X = {e 1 , e 2 , … , e |X| } , the probability of X appearing in the dataset is Note that record X can be duplicated in the dataset S ; given a query Q, the containment selectivity t of Q is calculated as where the sum is over all subsets of Q. The above equation enumerates every subset of the query Q to check if it appears in the dataset. In order to compute Eq. 7, we need to compute the joint probability Pr[X = 1] for each subset X of Q. Clearly, the complexity in Eq. 7 is exponentially dependent on the query size |Q| which is not acceptable when |Q| is large. Furthermore, the joint probability computation of Pr[X = 1] is complicated and expensive.
Given the difficulty of directly computing the containment selectivity, we consider the complement version of set The key point in the above equation is to calculate the union size of the inverted lists, which has the time complexity of ∑ e∈E�Q �I e � by merge join. Since the set of E∖Q and the inverted list I e could both be very large, directly computing the multi-union operation could result in unaffordable time consumption. Based on this, we adopt approximate methods (e.g., G-KMV sketch) to estimate the union size of the inverted lists.
For each element e ∈ E , L e denotes the G-KMV synopsis of its inverted list with k (=|L e | ) smallest hash values. Considering the union of inverted lists in Eq. 9, we have the sketch L = ⋃ e∈E⧵Q L e and k = |L| as introduced in Sect. 2.2, and then, the size D ∪ of the multi-union set ⋃ e∈E�Q I e can be estimated as , where U (k) is the kth smallest hash value in the synopsis L . Thus, the containment selectivity of G-KMV sketch-based method is computed as t G = m −D ∪ . Furthermore, the variance can be calculated as by Eq. 4.

Analysis
Given the space budget b in terms of the number of records, the sketch size of IL-GKMV method is |L| ≈ b * d where d denotes the average record length. By G-KMV sketch, the budget size is proportionally assigned to each inverted list. Apparently, with the very large vocabulary size, the performance significantly deteriorates since each inverted list receives little sampling space. Remark that the time complexity for simple random sampling method is where C is the time cost for set comparison. The time cost

Our Approach
As analyzed in the previous section, the random sampling approach fails to capture the element frequency distribution. IL-GKMV approach, on the other hand, considers data distribution by utilizing the inverted lists (i.e., frequent elements are associated with longer inverted lists) and G-KMV sketch (i.e., inverted lists with larger size keep more hashing values) techniques. However, because the inherent superset query semantics studied in this paper, the number of inverted lists involved in IL-GKMV method linearly depends on the vocabulary size which leads to compromised accuracy. In this section, we aim to develop sophisticated sampling approaches which strike a balance between accuracy and efficiency.

Trie Structure-Based Stratified Sampling Approach
Trie is a widely used tree data structure for storing a set of records (i.e., dataset). Observing that combinations of Figure 1 illustrates an ordered trie T built on dataset in Table 1. It is easy to see that each record in the trie is stored in a top-to-down manner with a start node as null. Next, we give an example about the labels.
Example 1 Consider the top-2 elements E 2 in Fig. 1; {e 2 , e 7 } is the label for records X 1 , X 3 , X 6 , {e 2 } is for records X 4 , X 2 , and {e 7 } is for X 5 .  Fig. 1 Trie structure high-frequency elements (i.e., frequent patterns) occur among records with high frequency, and records with similar frequent patterns are more likely to be included by the same query, we adopt the trie structure to partition the dataset using the combinations of high-frequency elements as labels. Assume that elements of the vocabulary E are ordered based on decreasing frequency in the underlying dataset. For example, the most frequent element in Table 1 is e 2 as it appears 5 times; e 7 appears 4 times and is ranked second place. Based on this ordering, we refer the top-k high-frequency elements as E k and adopt the combination of high-frequency elements within E k as label. The choice of k will be discussed later in Sect. 6. It is interesting to notice that the left and upper part of the trie encompasses most of the datasets, since this part is made up of high-frequency elements in the dataset. Based on this, there is a natural partition strategy generated by the trie T. Namely, from the root node along the high-frequency part (left and upper of trie), each path (label for records) comprises a partition of the dataset since records in the corresponding partition are all made up of this path as prefix. Note that all the remaining records that do not share any high-frequency element are accumulated as a partition by themselves, and we set the label of this partition as . Here is an example about the partition on trie.
Next, we propose an approximate method to compute the containment selectivity based on the partition P = {P 1 , … , P |P| } . Given a query record Q and sample size budget b (number of sampled records), we allocate the sample size budget proportionally to the size m i = |P i | of each partition in P (i.e., stratified sampling). Namely, for partition records uniformly at random sampled from P i . Let P ′ i denote these sampled records, i.e., } , then in each partition, the query Q is compared with each sampled records X ij ; let ij be the indicator such that then an estimator of the containment selectivity is Algorithm 1 illustrates the ordered trie-based sampling approach (OT-Sampling). Line 1 collects the k most frequent elements E k , and Line 2 constructs the ordered trie structure based on the dataset S , followed by obtaining the labels according to E k (Line 3). Lines 4-7 group the dataset based on the labels, and conduct the set containment search over each sampled P ′ i from individual partitions regarding Q. Line 8 retrieves the final selectivity estimation.

Analysis
Next, we show that the estimator t P in Eq. 11 is unbiased, followed by an analysis of the variance Var[t P ] . Recall that the containment selectivity is t = |{X|X ⊆ Q and X ∈ S}| ; for each partition P i , let t i be the size of subsets of Q in partition P i , i.e., t i = |{X|X ⊆ Q and X ∈ P i }| , and t = ∑ which means that the probability of a sampled record X ij in partition P i being the subset of Q is , and variance is by linearity of expectation, and thus, the expectation of Eq. 11 is which proves that t P is an unbiased estimator of containment selectivity. Similarly, the variance of t P is

Compare with Random Sampling (RS) Approach
Comparing the variance of OT-Sampling in Eq. 12 with that of RS-Sampling in Eq. 6, we show that Var[t P ] ≤ Var[t B ] as follows. Let p i denote the sampling probability in partition P i , and there is p i = m � i m i = b m by the stratified sampling strategy. Suppose that the number of partitions is q = |P| , then

Time Complexity
The time complexity of the OT-Sampling method is

Divide-and-Conquer-Based Sampling Approach
In OT-Sampling, the sampling strategy is independent of query workload; that is, we do not distinguish the data information (e.g., labels) of each partition with respect to the query. In this section, we propose a query-oriented sampling approach to improve the estimation accuracy.
Consider the records X's in a dataset as binary vectors with respect to the element universe E = {e 1 , … , e n } , i.e., each record is regarded as a size-n vector with ith position as 1 if e i ∈ X and 0 otherwise; divide the element universe E into two disjoint parts as E 1 and E 2 , and then, each record X can be written as two parts X 1 and X 2 corresponding E 1 and E 2 , respectively, and we have X = {X 1 ;X 2 } where X 1 is concatenated with X 2 . We give a lemma based on the division.

Lemma 1 (Subset inclusion) Given a query record Q and
a record X from the dataset S , Q and X are under the same division strategy described above and let Q = {Q 1 ;Q 2 } and X = {X 1 ;X 2 } . We have X ⊆ Q if and only if X 1 ⊆ Q 1 and X 2 ⊆ Q 2 .
The proof of the lemma is straightforward. From this lemma, a simple pruning technique can be derived such that if X 1 ⊈ Q 1 then X ⊈ Q.
Recall the tire-based partition method, we partition the dataset into several groups by the labels of records, where the label can be regarded as the representative for each partition. Before drawing samples from a partition with label X 1 , we can calculate if X 1 is a subset of query Q. If not, we can skip sampling from that collection of records with X 1 as a label. In order to specify the grouping of records, we give a definition as follows.
Definition 4 ((E 1 , E 2 )-prefix collection) Given E 1 and E 2 as the subsets of element universe E , the (E 1 , E 2 )-prefix collection of records denoted as S(E 1 , E 2 ) consists of all records X's from dataset S such that all elements of E 1 are contained in X while no element of E 2 appears in X, that is, S(E 1 , E 2 ) = {X ∈ S|E 1 ⊆ X and E 2 ∩ X = Φ}.
Note that E 1 and E 2 are, respectively, named as inclusion element set and exclusion element set.
Example 3 An ({e 2 }, {e 7 })-prefix collection in Table 1 is Now, we can present the lemma which lay the foundation of the divide-and-conquer algorithm.
Lemma 2 Considering a prefix collection S(E 1 , E 2 ) and an element e which does not belong to E 1 ∪ E 2 , the containment selectivity of a given query Q within S(E 1 , E 2 ) can be calculated as The key point in the proof of Lemma 2 is to consider the conditional probability. We omit the detailed proof here due to space limitation.
Recall that in Sect. 3.2, we model the record X as a random variable and give the probability that X appears in dataset S . Similarly, we compute the generating probability of the prefix collection S(E 1 , E 2 ) as follows: Next, we compute the number of subsets of a given query Q within the prefix collection S(E 1 , E 2 ) , i.e., the containment selectivity in regard to S(E 1 , E 2 ) . Let X denote the indicator function such that then the containment selectivity of Q with respect to Based on Lemma 2, we propose the divide-and-conquer algorithm illustrated in Algorithm 2. We can calculate the containment selectivity of Q within dataset S by invoking procedure (S, , , Q) ; by lemma 2, the dataset is partitioned into two groups of records by choosing an element e ∈ E and we have and then compute the containment selectivity in each of the two groups recursively as shown in Line 4-5. When there is E 1 ⊈ Q , we can prune this collection of records S(E 1 , E 2 ) by Lemma 1. Obviously, the time complexity of the exact divide-and-conquer algorithm is O(C * 2 n ) where n is the size of the element universe E and C is the cost of set comparison. Recall that the element frequency distribution is usually skew in real dataset, and we can arrange the elements by decreasing frequency order when choosing the element e in Line 4 of Algorithm 2, which can accelerate the computation by pruning more records corresponding to the highfrequency elements.

Approximate Divide-and-Conquer Algorithm
Next, we propose an approximate method based on the exact divide-and-conquer algorithm. In Algorithm 2, the dataset S is recursively partitioned into two collections of records by choosing an element e ∉ E 1 ∪ E 2 . In addition, we can order the elements by decreasing element frequency to boost the computation efficiency. However, the complexity is still O(C * 2 n ) . In this section, we only consider the top-k highfrequency elements E k , from which the element is selected to partition the dataset. After finishing all the elements in E k , we end up with 2 k prefix collections of records S i (E 1 , E 2 ) , i = 1, 2, … , 2 k , which is much smaller than 2 n . Note that (E 1 , E 2 ) can be regarded as the label for each prefix collection.
Recall Lemma 1, all the records X's can be described as the binary vector with X = {X 1 ;X 2 } where X 1 corresponds to the top-k high-frequency elements part E k and X 2 is the rest part concatenated with X 1 . Similarly, when a query record Q arrives, let Q be Q = {Q 1 ;Q 2 } following the same manner; then, by Lemma 1, we can exclude all the prefix collections S(E 1 , E 2 ) with E 1 ⊈ Q 1 . For the remaining prefix collections, we sample some records from each group and conduct containment search of Q over sampled records. Let X = {X 1 ;X 2 } be a sampled record, and it is only required to test if X 2 ⊆ Q 2 since X 1 ⊆ Q 1 . In the following part, we formally demonstrate how to estimate the containment selectivity of Q by the divide-and-conquer method.
Let i denote the indicator function for prefix collection S i (E 1 , E 2 ) ( S i for short) such that i = 1 when E 1 ⊆ Q 1 otherwise 0. The size of prefix collection S i (E 1 , E 2 ) can be computed as m i = |S i (E 1 , E 2 )| = m * Pr[S i (E 1 , E 2 )] by Eq. 13. Let p i be the sampling probability in S i , then the sample size is m � i = m i * p i . For any sampled record, X j = {X 1 ;X 2 } in this prefix collection S i , and let ij be the indicator for which ij = 1 if X 2 ⊆ Q 2 otherwise 0. Then, an estimator for the containment selectivity of Q by divide-and-conquer algorithm can be expressed as It can be verified that t D is an unbiased estimator and the variance of t D is where t i is the number of records satisfying X 2 ⊆ Q 2 in S i . Let S i , i = 1, 2, … , l be all the prefix collections with E 1 ⊆ Q 1 for a given query Q, then the variance can be writ-

Compare with OT-Sampling
Obviously, in DC-Sampling method, we avoid allocating the space budget to unqualified partitions compared with OT-Sampling. In formal, assume there are q partitions (corresponding to prefix collections) in total with {P 1 , … , P q } ; after pruning, there remains l partitions, w.l.o.g, {P 1 , … , P l } . Then, for DC-Sampling, the sampling probability is where m i = |P i | and b is space budget, and the sampling probability of OT-Sampling is

Time Complexity
The time complexity of DC-Sampling method is O(b * Ĉ) + O(P) whereĈ is the cost for two-record containment check. Here, we use merge join to check if one record is included by a given query record Q, and O(P) is the preprocess time on partition the records by prefix. After pruning the unqualified partitions, we can skip comparing the prefix part of a record with the query by our algorithm, and thus, the time cost of Ĉ is smaller than that of OT-sampling, which leads to better efficiency than DC-Sampling.

Selectivity Estimation on Weighted Set Containment Search
In this section, we consider the set containment search on weighted records. We first present a simple random sampling method to address the problem, followed by the stratified sampling method to boost the estimation accuracy.

Random Sampling Approach
Similar to the selectivity estimation of simple set containment search problem, we can apply the naive random sampling method to selectivity estimation with regard to weighted datasets. Namely, given a (weighted) query record Q and a space budget b, we first uniformly and at random sample b weighted records ( X 1 , X 2 , … , X b ) from the dataset S , and then, we compare query Q with each sampled record X i to verify if X i is weighted included by Q (according to Definition 2) and count the number of sampled records satisfying X i ⊆ w Q ; we finally scale up the counting result to get a selectivity estimation on weighted set containment search. Similar to Eq. 5, the estimator t w R is denoted as: where w i is the indicator function based on weighted records inclusion, i.e., w i = 1 if X i ⊆ w Q ; otherwise, w i = 0 . Obviously, the estimator t w R is an unbiased estimator; the variance can be computed as

Stratified Random Sampling
In order to boost the estimation accuracy, we can also utilize the partition-based sampling method (e.g., stratified sampling). Unfortunately, it is not applicable to group the records by building a trie structure on the weighted dataset, because the weights of each dimension among the records can be quite distinct and the number of partitions based on trie structure could be the same order of the number of records in the dataset. In the following part, we take into account the distribution of weights in each dimension and partition the dataset recursively on each dimension by dichotomizing the corresponding domain. Based on the partition, we present a divide-and-conquer algorithm given a query record to estimate the selectivity of weighted set containment search.
Consider the domain D j of jth dimension corresponding to e j ; let w j = {w 1j , w 2j , … , w mj } denote all the weights of dataset in jth dimension. Assume w ij ( i = 1, 2, … , m ) follows norm distribution with w ij ∼ N( j , 2 j ) , and then, we can use the mean value j as a boundary to partition the records; that is, records X i s with ith weight w ij < j are grouped together and the remaining records are collected in another group. Also, we choose the top-k high-frequency weights to partition the dataset. Remark that the weights of each record are sorted by the decreasing order of weight frequency in the dataset, and the weight frequency of jth dimension ( e j ) is the number of nonzero weights in this dimension. Similar to the estimation in Sect. 4.1, Algorithm 3 illustrates the estimation strategy of stratified sampling.
Note that when the weights are binary values, the above partition strategy is the same as Algorithm 2 for the simple (unweighted) case. Based on the partition P w = {P 1 , … , P |P| } , we can estimate the weighted set containment selectivity similar to Eq. 11 as which is an unbiased estimator with variance computed as Similarly, it can be proved that the variance of stratified sampling approach is smaller than that of naive random sampling method, i.e.,

Query-Oriented Sampling
Furthermore, the estimation accuracy can be improved by utilizing the query record information. Given a weighted query record Q with top-k high-frequency weighted elements E Q , we first compare E Q with the labels of each partition in P w = {P 1 , … , P |P w | } to prune some partitions with records that cannot be included by the query Q. With the pruned partition P � = {P � 1 , … , P � |P � | } , we can get an estimator as follows: It can also be shown that the variance of t ′ is smaller than

Experimental Evaluation
In this section, we evaluate the estimation accuracy and computation efficiency of different strategies on a variety of real-life datasets. All experiments are conducted on PCs with Intel Xeon 2 × 2.3 GHz CPU and 128 GB RAM running Debian Linux. We also evaluate the methods for weighted set containment selectivity estimation as follows.
• RS Direct random sampling method in Sect. 5.
The above algorithms are implemented in C++. In verifying the inclusion relationship between the query and records, we apply the merge join method. For records with large size, we utilize the prefix tree structure to boost the computation efficiency.
Datasets We deploy nine real-life datasets which are chosen from various domains with different data properties. In Table 3, we illustrate the characteristics of these nine datasets in details. Note that the last three are weighted datasets. For each dataset, we show the representations of record and element, the number of records, the average record length and the number of distinct elements in dataset.
Workload The workload for the selectivity estimation of set containment search is made up of 10,000 queries, each of which is uniformly and at random selected from the dataset. Note that we exclude the queries with size smaller than 10 in order to evaluate the accuracy properly.
Measurement In the following part, we use relative error to measure the accuracy. Let t be the exact result, and t be the estimation one, then the relative error denoted by is calculated as = |t−t| t . The sampling size is in terms of the number of records. For IL-GKMV approach, the space budget is allocated as discussed at the end of Sect. 3.
Tuning k In order to evaluate the impact of the high-frequency elements in OT-Sampling and DC-Sampling, we first tune the number of the highest frequency elements, i.e., topk. By experimental study, we set the k value as 12 which can well balance the trade-off between accuracy and efficiency.   set as 1000 in terms of the number of records; for trie structure-based approach and divide-and-conquer algorithm, the k value is 12 as mentioned above. Overall, we can see that the divide-and-conquer (DC-Sampling) algorithm achieves the best performance in the accuracy on all datasets, which can reduce the relative error of the random sampling (RS) method by around 60% and cut the relative error of IL-GKMV method by more than 80% . Also, the ordered trie structure-based approach (OT-Sampling) can diminish the relative error of RS by around 40% for most datasets and narrow the relative error of IL-GKMV by about 70% . Moreover, divide-and-conquer (DC-Sampling) algorithm outperforms the ordered tire structure-based approach (OT-Sampling) by decreasing the relative error about half. Figure 2b reports the query response time on six datasets with 10,000 queries, where DC-Sampling method consumes less time than the other three because of the pruning techniques. It is remarkable that for each dataset, the time costs of the four algorithms are comparable since we keep the same sample size in every algorithm. Meanwhile, the response time varies among different datasets because of the diverse average record lengths, and datasets with larger average length, e.g., NETFLIX with AvgLength 209.25, consume more query time.

Estimation Accuracy Evaluation
In this section, we assess the effectiveness of the four methods in terms of relative error. We consider the effect of space budget on the estimation accuracy by changing the sampling size. Figure 3 illustrates superior accuracy achievement of DC-Sampling against the other three by varying the space budget. As anticipated, the accuracy performance of all algorithms is ameliorated when more sampling size is provided.

Computation Efficiency Evaluation
In this section, we evaluate the efficiency of the four algorithms in terms of query response time with 10,000 queries. Figure 4 demonstrates the response time of four algorithms with different space budgets. Obviously, the query response time increases as the sampling size grows. The DC-Sampling method outperforms the other three algorithms because of the pruning techniques.

Weighted Set Containment Search
In the last part of experiment, we assess the estimation accuracy and efficiency of selectivity estimation of weighted set containment search. As for the experiment setting, we choose the value k as 12 for top-k high-frequency weighted elements. Figure 5 illustrates the overall accuracy and efficiency performance of WRS, StrRS and StrQRS method. We can see that the StrQRS method outperforms the other two algorithms under the same space budget, and StrRS method can achieve better accuracy than RS method. The time cost of the three algorithms (with 10,000 query records) is similar since the sample budget (1000 records) is same. Figure 6 compares the accuracy of the three algorithms. Obviously, with the space budget growing, the relative error gets smaller, i.e., the accuracy of the three methods keeps increasing. We can also find that the StrQRS method always beats the others with different sampling sizes.

Related Work
To the best our knowledge, there is no existing work on selectivity estimation of set containment search. In this section, we review two important directions closely related to the problem studied in this paper.

Searching Set-Valued Data
The study of set-valued data has attracted great attention from research communities and industrial organizations due to an ever increasing prevalence of set-valued data in a wide range of applications. The research in this area focuses on set containment search [18,19,28], set similarity and set containment joins [20,22,23,30]. In one of the representative works on set containment search [28], Terrovitis et al. introduce a OIF index combined the inverted index with B-tree to tackle three kinds of set containment queries: subset queries, equality queries and superset queries. In a recent work [34], Yang et al. propose a TT-join method for the set containment join problem, which is based on prefix tree structure and utilize the element frequency information; they also present a detailed summary of the existing set containment join methods. The containment queries can also be modeled as range searching problem in computational geometry [8]; nevertheless, the performance is exponentially dependent on dimension n which is unsuitable in practice for our problem.

Selectivity Estimation
The problem of selectivity estimation has been studied for a large variety of queries and over a diverse range of data types such as range queries (e.g., [16]), Boolean queries (e.g., [14]), relational joins (e.g., [29]), spatial join (e.g., [17]) and set intersection (e.g., [16]). Nevertheless, many of the techniques developed above are sensitive to the dimension of data and not applicable to the problem studied in this paper. Moreover, the superset containment semantics brings in extra challenges in adopting existing techniques. Although the set containment search query can be naturally modeled as range counting problem as discussed in Sect. 1, existing range counting techniques are exponentially dependent on the dimensionality (i.e., the number of distinct elements in our problem) and not applicable to solving the containment selectivity estimation problem in our problem [16,27]. Distinct value estimators (e.g., KMV [12], bottom-k, minhash [16]) are adopted in [32] to solve subset containment search (i.e., query record is a subset of data record). We also extend the distinct value estimator KMV and develop the IL-GKMV approach in Sect. 3 and demonstrate theoretically and through extensive experiments that distinct value estimators cannot efficiently and accurately support the superset containment semantics studied in this paper.

Conclusion
The prevalence of set-valued data generates a wide variety of applications that call for sophisticated processing techniques.
In this paper, we investigate the problem of selectivity estimation on set containment search and develop novel and efficient sampling-based techniques, OT-Sampling and DC-Sampling, to address the inherent challenges of set containment search and the limitations of existing techniques. Simple random sampling techniques and a G-KMV sketch-based estimating approach IL-GKMV are also devised as baseline solutions. Meanwhile, we consider the selectivity estimation of weighted set containment search and propose stratified sampling method to tackle this problem. We theoretically analyze the accuracy of the proposed techniques by means of expectation and variance. Our comprehensive experiments on six real-life datasets empirically verify the effectiveness and efficiency of the sampling-based approaches.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creat iveco mmons .org/licen ses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.