Skip to main content
Log in

Set containment join revisited

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Given two collections of set objects R and S, the \(R\bowtie _{\subseteq }S\) set containment join returns all object pairs \((r,s) \in R\times S\) such that \(r\subseteq s\). Besides being a basic operator in all modern data management systems with a wide range of applications, the join can be used to evaluate complex SQL queries based on relational division and as a module of data mining algorithms. The state-of-the-art algorithm for set containment joins (\(\mathtt {PRETTI}\)) builds an inverted index on the right-hand collection S and a prefix tree on the left-hand collection R that groups set objects with common prefixes and thus, avoids redundant processing. In this paper, we present a framework which improves \(\mathtt {PRETTI}\) in two directions. First, we limit the prefix tree construction by proposing an adaptive methodology based on a cost model; this way, we can greatly reduce the space and time cost of the join. Second, we partition the objects of each collection based on their first contained item, assuming that the set objects are internally sorted. We show that we can process the partitions and evaluate the join while building the prefix tree and the inverted index progressively. This allows us to significantly reduce not only the join cost, but also the maximum memory requirements during the join. An experimental evaluation using both real and synthetic datasets shows that our framework outperforms \(\mathtt {PRETTI}\) by a wide margin.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Notes

  1. Our experiments show that an increasing frequency order is in practice more beneficial. Yet, for the sake of readability, we present both \(\mathtt {PRETTI}\) and our methodology considering a decreasing order.

  2. This is different than the external-memory partitioning of the \(\mathtt {PRETTI}\) paradigm, discussed at the end of Sect. 2.

  3. Note that \(\mathtt {OPJ}\) and \(\mathtt {PRETTI}\) perform the same number of list intersections; i.e., \(\mathtt {OPJ}\) does not save list intersections, but makes them cheaper.

  4. These are infeasible methods using apriori knowledge which is not known at runtime and it is extremely expensive to compute before the join.

References

  1. Agrawal P, Arasu A, Kaushik R (2010) On indexing error-tolerant set containment. In: SIGMOD conference, pp 927–938

  2. Agrawal R, Srikant R (1994) Fast algorithms for mining association rules in large databases. In: VLDB, pp 487–499

  3. Arasu A, Ganti V, Kaushik R (2006) Efficient exact set-similarity joins. In: VLDB, pp 918–929

  4. Baeza-Yates RA (2004) A fast set intersection algorithm for sorted sequences. In: CPM, pp 400–408

  5. Baeza-Yates RA, Salinger A (2005) Experimental analysis of a fast intersection algorithm for sorted sequences. In: SPIRE, pp 13–24

  6. Baeza-Yates RA, Salinger A (2010) Fast intersection algorithms for sorted sequences. In: Algorithms and applications. Springer, Berlin Heidelberg, pp 45–61

  7. Barbay J, Kenyon C (2002) Adaptive intersection and t-threshold problems. In: SODA, pp 390–399

  8. Barbay J, López-Ortiz A, Lu T, Salinger A (2009) An experimental investigation of set intersection algorithms for text searching. ACM J Exp Algorithmics 14:7:3.7–7:3.24

  9. Bayardo RJ, Ma Y, Srikant R (2007) Scaling up all pairs similarity search. In: WWW

  10. Bouros P, Ge S, Mamoulis N (2012) Spatio-textual similarity joins. PVLDB 6(1):1–12

    Google Scholar 

  11. Broder A (1997) On the resemblance and containment of documents. In: SEQUENCES, pp 21–29

  12. Broder AZ (2000) Identifying and filtering near-duplicate documents. In: CPM, pp 1–10

  13. Cao B, Badia A (2005) A nested relational approach to processing SQL subqueries. In: SIGMOD conference, pp 191–202

  14. Chaudhuri S, Church KW, König AC, Sui L (2007) Heavy-tailed distributions and multi-keyword queries. In: SIGIR, pp 663–670

  15. Chaudhuri S, Ganti V, Kaushik R (2006) A primitive operator for similarity joins in data cleaning. In ICDE, p 5

  16. Chen Z, Korn F, Koudas N, Muthukrishnan S (2000) Selectivity estimation for boolean queries. In: PODS, pp 216–225

  17. Chen Z, Korn F, Koudas N, Muthukrishnan S (2003) Generalized substring selectivity estimation. J Comput Syst Sci 66(1):98–132

    Article  MathSciNet  MATH  Google Scholar 

  18. Culpepper JS, Moffat A (2010) Efficient set intersection for inverted indexing. ACM Trans Inf Syst 29(1):1

    Article  Google Scholar 

  19. Demaine ED, López-Ortiz A, Munro JI (2000) Adaptive set intersections, unions, and differences. In: SODA, pp 743–752

  20. Demaine ED, López-Ortiz A, Munro JI (2001) Experiments on adaptive set intersections for text retrieval systems. In: ALENEX, pp 91–104

  21. Helmer S, Moerkotte G (1997) Evaluation of main memory join algorithms for joins with set comparison join predicates. In: VLDB, pp 386–395

  22. Helmer S, Moerkotte G (2003) A performance study of four index structures for set-valued attributes of low cardinality. VLDBJ 12(3):244–261

    Article  Google Scholar 

  23. Ibrahim A, Fletcher GHL (2013) Efficient processing of containment queries on nested sets. In: EDBT, pp 227–238

  24. Jampani R, Pudi V (2005) Using prefix-trees for efficiently computing set joins. In: DASFAA, pp 761–772

  25. Jiang Y, Li G, Feng J, Li W (2014) String similarity joins: an experimental evaluation. PVLDB 7(8):625–636

    Google Scholar 

  26. Köhler H (2010) Estimating set intersection using small samples. In: ACSC, pp 71–78

  27. Mamoulis N (2003) Efficient processing of joins on set-valued attributes. In: SIGMOD conference, pp 157–168

  28. Melnik S, Garcia-Molina H (2002) Divide-and-conquer algorithm for computing set containment joins. In: EDBT, pp 427–444

  29. Melnik S, Garcia-Molina H (2003) Adaptive algorithms for set containment joins. ACM Trans Database Syst 28:56–99

    Article  Google Scholar 

  30. Ramasamy K, Patel JM, Naughton JF, Kaushik R (2000) Set containment joins: The good, the bad and the ugly. In: VLDB, pp 351–362

  31. Rantzau R (2003) Processing frequent itemset discovery queries by division and set containment join operators. In: DMKD, pp 20–27

  32. Rantzau R, Shapiro LD, Mitschang B, Wang Q (2003) Algorithms and applications for universal quantification in relational databases. Inf Syst 28(1–2):3–32

    Article  MATH  Google Scholar 

  33. Ribeiro L, Härder T (2009) Efficient set similarity joins using min-prefixes. In: Advances in databases and information systems, 13th East European conference, ADBIS 2009, Riga, Latvia, September 7–10, 2009. Proceedings, pp 88–102

  34. Sarawagi S, Kirpal A (2004) Efficient set joins on similarity predicates. In: SIGMOD conference, pp 743–754

  35. Tatikonda S, Cambazoglu BB, Junqueira FP (2011) Posting list intersection on multicore architectures. In: SIGIR, pp 963–972

  36. Tatikonda S, Junqueira F, Cambazoglu BB, Plachouras V (2009) On efficient posting list intersection with multicore processors. In: SIGIR, pp 738–739

  37. Terrovitis M, Bouros P, Vassiliadis P, Sellis TK, Mamoulis N (2011) Efficient answering of set containment queries for skewed item distributions. In: EDBT, pp 225–236

  38. Terrovitis M, Passas S, Vassiliadis P, Sellis TK (2006) A combination of trie-trees and inverted files for the indexing of set-valued attributes. In: CIKM, pp 728–737

  39. Tsirogiannis D, Guha S, Koudas N (2009) Improving the performance of list intersection. PVLDB 2(1):838–849

    Google Scholar 

  40. Wang J, Li G, Feng J (2012) Can we beat the prefix filtering? An adaptive framework for similarity join and search. In: SIGMOD conference, pp 85–96

  41. Xiao C, Wang W, Lin X, Yu JX (2008) Efficient similarity joins for near duplicate detection. In: WWW, pp 131–140

  42. Zhang X, Chen K, Shou L, Chen G, Gao Y, Tan K-L (2012) Efficient processing of probabilistic set-containment queries on uncertain set-valued data. Inf Sci 196:97–117

    Article  MathSciNet  MATH  Google Scholar 

  43. Zheng Z, Kohavi R, Mason L (2001) Real world performance of association rule algorithms. In: KDD, pp 401–406

  44. Zobel J, Moffat A, Ramamohanarao K (1998) Inverted files versus signature files for text indexing. TOIS 23(4):453–490

    Google Scholar 

Download references

Acknowledgments

This work was supported by the HKU 714212E Grant from Hong Kong RGC and the MEDA project within GSRTs KRIPIS action, funded by Greece and the European Regional Development Fund of the European Union under the O.P. Competitiveness and Entrepreneurship, NSRF 2007–2013 and the Regional Operational Program of ATTIKI.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Panagiotis Bouros.

Additional information

This work was conducted while P. Bouros was with The University of Hong Kong.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Bouros, P., Mamoulis, N., Ge, S. et al. Set containment join revisited. Knowl Inf Syst 49, 375–402 (2016). https://doi.org/10.1007/s10115-015-0895-7

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-015-0895-7

Keywords

Navigation