Set containment join revisited

Bouros, Panagiotis; Mamoulis, Nikos; Ge, Shen; Terrovitis, Manolis

doi:10.1007/s10115-015-0895-7

Set containment join revisited

Regular Paper
Published: 26 October 2015

Volume 49, pages 375–402, (2016)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Panagiotis Bouros¹,
Nikos Mamoulis²,
Shen Ge² &
…
Manolis Terrovitis³

417 Accesses
16 Citations
1 Altmetric
Explore all metrics

Abstract

Given two collections of set objects R and S, the \(R\bowtie _{\subseteq }S\) set containment join returns all object pairs \((r,s) \in R\times S\) such that \(r\subseteq s\). Besides being a basic operator in all modern data management systems with a wide range of applications, the join can be used to evaluate complex SQL queries based on relational division and as a module of data mining algorithms. The state-of-the-art algorithm for set containment joins (\(\mathtt {PRETTI}\)) builds an inverted index on the right-hand collection S and a prefix tree on the left-hand collection R that groups set objects with common prefixes and thus, avoids redundant processing. In this paper, we present a framework which improves \(\mathtt {PRETTI}\) in two directions. First, we limit the prefix tree construction by proposing an adaptive methodology based on a cost model; this way, we can greatly reduce the space and time cost of the join. Second, we partition the objects of each collection based on their first contained item, assuming that the set objects are internally sorted. We show that we can process the partitions and evaluate the join while building the prefix tree and the inverted index progressively. This allows us to significantly reduce not only the join cost, but also the maximum memory requirements during the join. An experimental evaluation using both real and synthetic datasets shows that our framework outperforms \(\mathtt {PRETTI}\) by a wide margin.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Efficient set containment join

Article 11 May 2018

FreshJoin: An Efficient and Adaptive Algorithm for Set Containment Join

Article Open access 09 November 2019

FreshJoin: An Efficient and Adaptive Algorithm for Set Containment Join

Notes

Our experiments show that an increasing frequency order is in practice more beneficial. Yet, for the sake of readability, we present both \(\mathtt {PRETTI}\) and our methodology considering a decreasing order.
This is different than the external-memory partitioning of the \(\mathtt {PRETTI}\) paradigm, discussed at the end of Sect. 2.
Note that \(\mathtt {OPJ}\) and \(\mathtt {PRETTI}\) perform the same number of list intersections; i.e., \(\mathtt {OPJ}\) does not save list intersections, but makes them cheaper.
These are infeasible methods using apriori knowledge which is not known at runtime and it is extremely expensive to compute before the join.

References

Agrawal P, Arasu A, Kaushik R (2010) On indexing error-tolerant set containment. In: SIGMOD conference, pp 927–938
Agrawal R, Srikant R (1994) Fast algorithms for mining association rules in large databases. In: VLDB, pp 487–499
Arasu A, Ganti V, Kaushik R (2006) Efficient exact set-similarity joins. In: VLDB, pp 918–929
Baeza-Yates RA (2004) A fast set intersection algorithm for sorted sequences. In: CPM, pp 400–408
Baeza-Yates RA, Salinger A (2005) Experimental analysis of a fast intersection algorithm for sorted sequences. In: SPIRE, pp 13–24
Baeza-Yates RA, Salinger A (2010) Fast intersection algorithms for sorted sequences. In: Algorithms and applications. Springer, Berlin Heidelberg, pp 45–61
Barbay J, Kenyon C (2002) Adaptive intersection and t-threshold problems. In: SODA, pp 390–399
Barbay J, López-Ortiz A, Lu T, Salinger A (2009) An experimental investigation of set intersection algorithms for text searching. ACM J Exp Algorithmics 14:7:3.7–7:3.24
Bayardo RJ, Ma Y, Srikant R (2007) Scaling up all pairs similarity search. In: WWW
Bouros P, Ge S, Mamoulis N (2012) Spatio-textual similarity joins. PVLDB 6(1):1–12
Google Scholar
Broder A (1997) On the resemblance and containment of documents. In: SEQUENCES, pp 21–29
Broder AZ (2000) Identifying and filtering near-duplicate documents. In: CPM, pp 1–10
Cao B, Badia A (2005) A nested relational approach to processing SQL subqueries. In: SIGMOD conference, pp 191–202
Chaudhuri S, Church KW, König AC, Sui L (2007) Heavy-tailed distributions and multi-keyword queries. In: SIGIR, pp 663–670
Chaudhuri S, Ganti V, Kaushik R (2006) A primitive operator for similarity joins in data cleaning. In ICDE, p 5
Chen Z, Korn F, Koudas N, Muthukrishnan S (2000) Selectivity estimation for boolean queries. In: PODS, pp 216–225
Chen Z, Korn F, Koudas N, Muthukrishnan S (2003) Generalized substring selectivity estimation. J Comput Syst Sci 66(1):98–132
Article MathSciNet MATH Google Scholar
Culpepper JS, Moffat A (2010) Efficient set intersection for inverted indexing. ACM Trans Inf Syst 29(1):1
Article Google Scholar
Demaine ED, López-Ortiz A, Munro JI (2000) Adaptive set intersections, unions, and differences. In: SODA, pp 743–752
Demaine ED, López-Ortiz A, Munro JI (2001) Experiments on adaptive set intersections for text retrieval systems. In: ALENEX, pp 91–104
Helmer S, Moerkotte G (1997) Evaluation of main memory join algorithms for joins with set comparison join predicates. In: VLDB, pp 386–395
Helmer S, Moerkotte G (2003) A performance study of four index structures for set-valued attributes of low cardinality. VLDBJ 12(3):244–261
Article Google Scholar
Ibrahim A, Fletcher GHL (2013) Efficient processing of containment queries on nested sets. In: EDBT, pp 227–238
Jampani R, Pudi V (2005) Using prefix-trees for efficiently computing set joins. In: DASFAA, pp 761–772
Jiang Y, Li G, Feng J, Li W (2014) String similarity joins: an experimental evaluation. PVLDB 7(8):625–636
Google Scholar
Köhler H (2010) Estimating set intersection using small samples. In: ACSC, pp 71–78
Mamoulis N (2003) Efficient processing of joins on set-valued attributes. In: SIGMOD conference, pp 157–168
Melnik S, Garcia-Molina H (2002) Divide-and-conquer algorithm for computing set containment joins. In: EDBT, pp 427–444
Melnik S, Garcia-Molina H (2003) Adaptive algorithms for set containment joins. ACM Trans Database Syst 28:56–99
Article Google Scholar
Ramasamy K, Patel JM, Naughton JF, Kaushik R (2000) Set containment joins: The good, the bad and the ugly. In: VLDB, pp 351–362
Rantzau R (2003) Processing frequent itemset discovery queries by division and set containment join operators. In: DMKD, pp 20–27
Rantzau R, Shapiro LD, Mitschang B, Wang Q (2003) Algorithms and applications for universal quantification in relational databases. Inf Syst 28(1–2):3–32
Article MATH Google Scholar
Ribeiro L, Härder T (2009) Efficient set similarity joins using min-prefixes. In: Advances in databases and information systems, 13th East European conference, ADBIS 2009, Riga, Latvia, September 7–10, 2009. Proceedings, pp 88–102
Sarawagi S, Kirpal A (2004) Efficient set joins on similarity predicates. In: SIGMOD conference, pp 743–754
Tatikonda S, Cambazoglu BB, Junqueira FP (2011) Posting list intersection on multicore architectures. In: SIGIR, pp 963–972
Tatikonda S, Junqueira F, Cambazoglu BB, Plachouras V (2009) On efficient posting list intersection with multicore processors. In: SIGIR, pp 738–739
Terrovitis M, Bouros P, Vassiliadis P, Sellis TK, Mamoulis N (2011) Efficient answering of set containment queries for skewed item distributions. In: EDBT, pp 225–236
Terrovitis M, Passas S, Vassiliadis P, Sellis TK (2006) A combination of trie-trees and inverted files for the indexing of set-valued attributes. In: CIKM, pp 728–737
Tsirogiannis D, Guha S, Koudas N (2009) Improving the performance of list intersection. PVLDB 2(1):838–849
Google Scholar
Wang J, Li G, Feng J (2012) Can we beat the prefix filtering? An adaptive framework for similarity join and search. In: SIGMOD conference, pp 85–96
Xiao C, Wang W, Lin X, Yu JX (2008) Efficient similarity joins for near duplicate detection. In: WWW, pp 131–140
Zhang X, Chen K, Shou L, Chen G, Gao Y, Tan K-L (2012) Efficient processing of probabilistic set-containment queries on uncertain set-valued data. Inf Sci 196:97–117
Article MathSciNet MATH Google Scholar
Zheng Z, Kohavi R, Mason L (2001) Real world performance of association rule algorithms. In: KDD, pp 401–406
Zobel J, Moffat A, Ramamohanarao K (1998) Inverted files versus signature files for text indexing. TOIS 23(4):453–490
Google Scholar

Download references

Acknowledgments

This work was supported by the HKU 714212E Grant from Hong Kong RGC and the MEDA project within GSRTs KRIPIS action, funded by Greece and the European Regional Development Fund of the European Union under the O.P. Competitiveness and Entrepreneurship, NSRF 2007–2013 and the Regional Operational Program of ATTIKI.

Author information

Authors and Affiliations

Department of Computer Science, Aarhus University, Aabogade 34, 8200, Aarhus N, Denmark
Panagiotis Bouros
Department of Computer Science, The University of Hong Kong, Pokfulam Road, Hong Kong, SAR, China
Nikos Mamoulis & Shen Ge
Institute for the Management of Information Systems, Research Center “Athena”, Artemidos 6 & Epidavrou, 15125, Marousi, Greece
Manolis Terrovitis

Authors

Panagiotis Bouros
View author publications
You can also search for this author in PubMed Google Scholar
Nikos Mamoulis
View author publications
You can also search for this author in PubMed Google Scholar
Shen Ge
View author publications
You can also search for this author in PubMed Google Scholar
Manolis Terrovitis
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Panagiotis Bouros.

Additional information

This work was conducted while P. Bouros was with The University of Hong Kong.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bouros, P., Mamoulis, N., Ge, S. et al. Set containment join revisited. Knowl Inf Syst 49, 375–402 (2016). https://doi.org/10.1007/s10115-015-0895-7

Download citation

Received: 28 January 2015
Revised: 28 July 2015
Accepted: 03 October 2015
Published: 26 October 2015
Issue Date: October 2016
DOI: https://doi.org/10.1007/s10115-015-0895-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Set containment join revisited

Abstract

Access this article

Similar content being viewed by others

Efficient set containment join

FreshJoin: An Efficient and Adaptive Algorithm for Set Containment Join

FreshJoin: An Efficient and Adaptive Algorithm for Set Containment Join

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Set containment join revisited

Abstract

Access this article

Similar content being viewed by others

Efficient set containment join

FreshJoin: An Efficient and Adaptive Algorithm for Set Containment Join

FreshJoin: An Efficient and Adaptive Algorithm for Set Containment Join

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation