Abstract
Given two collections of set objects R and S, the \(R\bowtie _{\subseteq }S\) set containment join returns all object pairs \((r,s) \in R\times S\) such that \(r\subseteq s\). Besides being a basic operator in all modern data management systems with a wide range of applications, the join can be used to evaluate complex SQL queries based on relational division and as a module of data mining algorithms. The state-of-the-art algorithm for set containment joins (\(\mathtt {PRETTI}\)) builds an inverted index on the right-hand collection S and a prefix tree on the left-hand collection R that groups set objects with common prefixes and thus, avoids redundant processing. In this paper, we present a framework which improves \(\mathtt {PRETTI}\) in two directions. First, we limit the prefix tree construction by proposing an adaptive methodology based on a cost model; this way, we can greatly reduce the space and time cost of the join. Second, we partition the objects of each collection based on their first contained item, assuming that the set objects are internally sorted. We show that we can process the partitions and evaluate the join while building the prefix tree and the inverted index progressively. This allows us to significantly reduce not only the join cost, but also the maximum memory requirements during the join. An experimental evaluation using both real and synthetic datasets shows that our framework outperforms \(\mathtt {PRETTI}\) by a wide margin.
Similar content being viewed by others
Notes
Our experiments show that an increasing frequency order is in practice more beneficial. Yet, for the sake of readability, we present both \(\mathtt {PRETTI}\) and our methodology considering a decreasing order.
This is different than the external-memory partitioning of the \(\mathtt {PRETTI}\) paradigm, discussed at the end of Sect. 2.
Note that \(\mathtt {OPJ}\) and \(\mathtt {PRETTI}\) perform the same number of list intersections; i.e., \(\mathtt {OPJ}\) does not save list intersections, but makes them cheaper.
These are infeasible methods using apriori knowledge which is not known at runtime and it is extremely expensive to compute before the join.
References
Agrawal P, Arasu A, Kaushik R (2010) On indexing error-tolerant set containment. In: SIGMOD conference, pp 927–938
Agrawal R, Srikant R (1994) Fast algorithms for mining association rules in large databases. In: VLDB, pp 487–499
Arasu A, Ganti V, Kaushik R (2006) Efficient exact set-similarity joins. In: VLDB, pp 918–929
Baeza-Yates RA (2004) A fast set intersection algorithm for sorted sequences. In: CPM, pp 400–408
Baeza-Yates RA, Salinger A (2005) Experimental analysis of a fast intersection algorithm for sorted sequences. In: SPIRE, pp 13–24
Baeza-Yates RA, Salinger A (2010) Fast intersection algorithms for sorted sequences. In: Algorithms and applications. Springer, Berlin Heidelberg, pp 45–61
Barbay J, Kenyon C (2002) Adaptive intersection and t-threshold problems. In: SODA, pp 390–399
Barbay J, López-Ortiz A, Lu T, Salinger A (2009) An experimental investigation of set intersection algorithms for text searching. ACM J Exp Algorithmics 14:7:3.7–7:3.24
Bayardo RJ, Ma Y, Srikant R (2007) Scaling up all pairs similarity search. In: WWW
Bouros P, Ge S, Mamoulis N (2012) Spatio-textual similarity joins. PVLDB 6(1):1–12
Broder A (1997) On the resemblance and containment of documents. In: SEQUENCES, pp 21–29
Broder AZ (2000) Identifying and filtering near-duplicate documents. In: CPM, pp 1–10
Cao B, Badia A (2005) A nested relational approach to processing SQL subqueries. In: SIGMOD conference, pp 191–202
Chaudhuri S, Church KW, König AC, Sui L (2007) Heavy-tailed distributions and multi-keyword queries. In: SIGIR, pp 663–670
Chaudhuri S, Ganti V, Kaushik R (2006) A primitive operator for similarity joins in data cleaning. In ICDE, p 5
Chen Z, Korn F, Koudas N, Muthukrishnan S (2000) Selectivity estimation for boolean queries. In: PODS, pp 216–225
Chen Z, Korn F, Koudas N, Muthukrishnan S (2003) Generalized substring selectivity estimation. J Comput Syst Sci 66(1):98–132
Culpepper JS, Moffat A (2010) Efficient set intersection for inverted indexing. ACM Trans Inf Syst 29(1):1
Demaine ED, López-Ortiz A, Munro JI (2000) Adaptive set intersections, unions, and differences. In: SODA, pp 743–752
Demaine ED, López-Ortiz A, Munro JI (2001) Experiments on adaptive set intersections for text retrieval systems. In: ALENEX, pp 91–104
Helmer S, Moerkotte G (1997) Evaluation of main memory join algorithms for joins with set comparison join predicates. In: VLDB, pp 386–395
Helmer S, Moerkotte G (2003) A performance study of four index structures for set-valued attributes of low cardinality. VLDBJ 12(3):244–261
Ibrahim A, Fletcher GHL (2013) Efficient processing of containment queries on nested sets. In: EDBT, pp 227–238
Jampani R, Pudi V (2005) Using prefix-trees for efficiently computing set joins. In: DASFAA, pp 761–772
Jiang Y, Li G, Feng J, Li W (2014) String similarity joins: an experimental evaluation. PVLDB 7(8):625–636
Köhler H (2010) Estimating set intersection using small samples. In: ACSC, pp 71–78
Mamoulis N (2003) Efficient processing of joins on set-valued attributes. In: SIGMOD conference, pp 157–168
Melnik S, Garcia-Molina H (2002) Divide-and-conquer algorithm for computing set containment joins. In: EDBT, pp 427–444
Melnik S, Garcia-Molina H (2003) Adaptive algorithms for set containment joins. ACM Trans Database Syst 28:56–99
Ramasamy K, Patel JM, Naughton JF, Kaushik R (2000) Set containment joins: The good, the bad and the ugly. In: VLDB, pp 351–362
Rantzau R (2003) Processing frequent itemset discovery queries by division and set containment join operators. In: DMKD, pp 20–27
Rantzau R, Shapiro LD, Mitschang B, Wang Q (2003) Algorithms and applications for universal quantification in relational databases. Inf Syst 28(1–2):3–32
Ribeiro L, Härder T (2009) Efficient set similarity joins using min-prefixes. In: Advances in databases and information systems, 13th East European conference, ADBIS 2009, Riga, Latvia, September 7–10, 2009. Proceedings, pp 88–102
Sarawagi S, Kirpal A (2004) Efficient set joins on similarity predicates. In: SIGMOD conference, pp 743–754
Tatikonda S, Cambazoglu BB, Junqueira FP (2011) Posting list intersection on multicore architectures. In: SIGIR, pp 963–972
Tatikonda S, Junqueira F, Cambazoglu BB, Plachouras V (2009) On efficient posting list intersection with multicore processors. In: SIGIR, pp 738–739
Terrovitis M, Bouros P, Vassiliadis P, Sellis TK, Mamoulis N (2011) Efficient answering of set containment queries for skewed item distributions. In: EDBT, pp 225–236
Terrovitis M, Passas S, Vassiliadis P, Sellis TK (2006) A combination of trie-trees and inverted files for the indexing of set-valued attributes. In: CIKM, pp 728–737
Tsirogiannis D, Guha S, Koudas N (2009) Improving the performance of list intersection. PVLDB 2(1):838–849
Wang J, Li G, Feng J (2012) Can we beat the prefix filtering? An adaptive framework for similarity join and search. In: SIGMOD conference, pp 85–96
Xiao C, Wang W, Lin X, Yu JX (2008) Efficient similarity joins for near duplicate detection. In: WWW, pp 131–140
Zhang X, Chen K, Shou L, Chen G, Gao Y, Tan K-L (2012) Efficient processing of probabilistic set-containment queries on uncertain set-valued data. Inf Sci 196:97–117
Zheng Z, Kohavi R, Mason L (2001) Real world performance of association rule algorithms. In: KDD, pp 401–406
Zobel J, Moffat A, Ramamohanarao K (1998) Inverted files versus signature files for text indexing. TOIS 23(4):453–490
Acknowledgments
This work was supported by the HKU 714212E Grant from Hong Kong RGC and the MEDA project within GSRTs KRIPIS action, funded by Greece and the European Regional Development Fund of the European Union under the O.P. Competitiveness and Entrepreneurship, NSRF 2007–2013 and the Regional Operational Program of ATTIKI.
Author information
Authors and Affiliations
Corresponding author
Additional information
This work was conducted while P. Bouros was with The University of Hong Kong.
Rights and permissions
About this article
Cite this article
Bouros, P., Mamoulis, N., Ge, S. et al. Set containment join revisited. Knowl Inf Syst 49, 375–402 (2016). https://doi.org/10.1007/s10115-015-0895-7
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-015-0895-7