Advertisement

Knowledge and Information Systems

, Volume 49, Issue 1, pp 375–402 | Cite as

Set containment join revisited

  • Panagiotis BourosEmail author
  • Nikos Mamoulis
  • Shen Ge
  • Manolis Terrovitis
Regular Paper

Abstract

Given two collections of set objects R and S, the \(R\bowtie _{\subseteq }S\) set containment join returns all object pairs \((r,s) \in R\times S\) such that \(r\subseteq s\). Besides being a basic operator in all modern data management systems with a wide range of applications, the join can be used to evaluate complex SQL queries based on relational division and as a module of data mining algorithms. The state-of-the-art algorithm for set containment joins (\(\mathtt {PRETTI}\)) builds an inverted index on the right-hand collection S and a prefix tree on the left-hand collection R that groups set objects with common prefixes and thus, avoids redundant processing. In this paper, we present a framework which improves \(\mathtt {PRETTI}\) in two directions. First, we limit the prefix tree construction by proposing an adaptive methodology based on a cost model; this way, we can greatly reduce the space and time cost of the join. Second, we partition the objects of each collection based on their first contained item, assuming that the set objects are internally sorted. We show that we can process the partitions and evaluate the join while building the prefix tree and the inverted index progressively. This allows us to significantly reduce not only the join cost, but also the maximum memory requirements during the join. An experimental evaluation using both real and synthetic datasets shows that our framework outperforms \(\mathtt {PRETTI}\) by a wide margin.

Keywords

Set-valued data Containment join Query processing Inverted index Prefix tree 

Notes

Acknowledgments

This work was supported by the HKU 714212E Grant from Hong Kong RGC and the MEDA project within GSRTs KRIPIS action, funded by Greece and the European Regional Development Fund of the European Union under the O.P. Competitiveness and Entrepreneurship, NSRF 2007–2013 and the Regional Operational Program of ATTIKI.

References

  1. 1.
    Agrawal P, Arasu A, Kaushik R (2010) On indexing error-tolerant set containment. In: SIGMOD conference, pp 927–938Google Scholar
  2. 2.
    Agrawal R, Srikant R (1994) Fast algorithms for mining association rules in large databases. In: VLDB, pp 487–499Google Scholar
  3. 3.
    Arasu A, Ganti V, Kaushik R (2006) Efficient exact set-similarity joins. In: VLDB, pp 918–929Google Scholar
  4. 4.
    Baeza-Yates RA (2004) A fast set intersection algorithm for sorted sequences. In: CPM, pp 400–408Google Scholar
  5. 5.
    Baeza-Yates RA, Salinger A (2005) Experimental analysis of a fast intersection algorithm for sorted sequences. In: SPIRE, pp 13–24Google Scholar
  6. 6.
    Baeza-Yates RA, Salinger A (2010) Fast intersection algorithms for sorted sequences. In: Algorithms and applications. Springer, Berlin Heidelberg, pp 45–61Google Scholar
  7. 7.
    Barbay J, Kenyon C (2002) Adaptive intersection and t-threshold problems. In: SODA, pp 390–399Google Scholar
  8. 8.
    Barbay J, López-Ortiz A, Lu T, Salinger A (2009) An experimental investigation of set intersection algorithms for text searching. ACM J Exp Algorithmics 14:7:3.7–7:3.24Google Scholar
  9. 9.
    Bayardo RJ, Ma Y, Srikant R (2007) Scaling up all pairs similarity search. In: WWWGoogle Scholar
  10. 10.
    Bouros P, Ge S, Mamoulis N (2012) Spatio-textual similarity joins. PVLDB 6(1):1–12Google Scholar
  11. 11.
    Broder A (1997) On the resemblance and containment of documents. In: SEQUENCES, pp 21–29Google Scholar
  12. 12.
    Broder AZ (2000) Identifying and filtering near-duplicate documents. In: CPM, pp 1–10Google Scholar
  13. 13.
    Cao B, Badia A (2005) A nested relational approach to processing SQL subqueries. In: SIGMOD conference, pp 191–202Google Scholar
  14. 14.
    Chaudhuri S, Church KW, König AC, Sui L (2007) Heavy-tailed distributions and multi-keyword queries. In: SIGIR, pp 663–670Google Scholar
  15. 15.
    Chaudhuri S, Ganti V, Kaushik R (2006) A primitive operator for similarity joins in data cleaning. In ICDE, p 5Google Scholar
  16. 16.
    Chen Z, Korn F, Koudas N, Muthukrishnan S (2000) Selectivity estimation for boolean queries. In: PODS, pp 216–225Google Scholar
  17. 17.
    Chen Z, Korn F, Koudas N, Muthukrishnan S (2003) Generalized substring selectivity estimation. J Comput Syst Sci 66(1):98–132MathSciNetCrossRefzbMATHGoogle Scholar
  18. 18.
    Culpepper JS, Moffat A (2010) Efficient set intersection for inverted indexing. ACM Trans Inf Syst 29(1):1CrossRefGoogle Scholar
  19. 19.
    Demaine ED, López-Ortiz A, Munro JI (2000) Adaptive set intersections, unions, and differences. In: SODA, pp 743–752Google Scholar
  20. 20.
    Demaine ED, López-Ortiz A, Munro JI (2001) Experiments on adaptive set intersections for text retrieval systems. In: ALENEX, pp 91–104Google Scholar
  21. 21.
    Helmer S, Moerkotte G (1997) Evaluation of main memory join algorithms for joins with set comparison join predicates. In: VLDB, pp 386–395Google Scholar
  22. 22.
    Helmer S, Moerkotte G (2003) A performance study of four index structures for set-valued attributes of low cardinality. VLDBJ 12(3):244–261CrossRefGoogle Scholar
  23. 23.
    Ibrahim A, Fletcher GHL (2013) Efficient processing of containment queries on nested sets. In: EDBT, pp 227–238Google Scholar
  24. 24.
    Jampani R, Pudi V (2005) Using prefix-trees for efficiently computing set joins. In: DASFAA, pp 761–772Google Scholar
  25. 25.
    Jiang Y, Li G, Feng J, Li W (2014) String similarity joins: an experimental evaluation. PVLDB 7(8):625–636Google Scholar
  26. 26.
    Köhler H (2010) Estimating set intersection using small samples. In: ACSC, pp 71–78Google Scholar
  27. 27.
    Mamoulis N (2003) Efficient processing of joins on set-valued attributes. In: SIGMOD conference, pp 157–168Google Scholar
  28. 28.
    Melnik S, Garcia-Molina H (2002) Divide-and-conquer algorithm for computing set containment joins. In: EDBT, pp 427–444Google Scholar
  29. 29.
    Melnik S, Garcia-Molina H (2003) Adaptive algorithms for set containment joins. ACM Trans Database Syst 28:56–99CrossRefGoogle Scholar
  30. 30.
    Ramasamy K, Patel JM, Naughton JF, Kaushik R (2000) Set containment joins: The good, the bad and the ugly. In: VLDB, pp 351–362Google Scholar
  31. 31.
    Rantzau R (2003) Processing frequent itemset discovery queries by division and set containment join operators. In: DMKD, pp 20–27Google Scholar
  32. 32.
    Rantzau R, Shapiro LD, Mitschang B, Wang Q (2003) Algorithms and applications for universal quantification in relational databases. Inf Syst 28(1–2):3–32CrossRefzbMATHGoogle Scholar
  33. 33.
    Ribeiro L, Härder T (2009) Efficient set similarity joins using min-prefixes. In: Advances in databases and information systems, 13th East European conference, ADBIS 2009, Riga, Latvia, September 7–10, 2009. Proceedings, pp 88–102Google Scholar
  34. 34.
    Sarawagi S, Kirpal A (2004) Efficient set joins on similarity predicates. In: SIGMOD conference, pp 743–754Google Scholar
  35. 35.
    Tatikonda S, Cambazoglu BB, Junqueira FP (2011) Posting list intersection on multicore architectures. In: SIGIR, pp 963–972Google Scholar
  36. 36.
    Tatikonda S, Junqueira F, Cambazoglu BB, Plachouras V (2009) On efficient posting list intersection with multicore processors. In: SIGIR, pp 738–739Google Scholar
  37. 37.
    Terrovitis M, Bouros P, Vassiliadis P, Sellis TK, Mamoulis N (2011) Efficient answering of set containment queries for skewed item distributions. In: EDBT, pp 225–236Google Scholar
  38. 38.
    Terrovitis M, Passas S, Vassiliadis P, Sellis TK (2006) A combination of trie-trees and inverted files for the indexing of set-valued attributes. In: CIKM, pp 728–737Google Scholar
  39. 39.
    Tsirogiannis D, Guha S, Koudas N (2009) Improving the performance of list intersection. PVLDB 2(1):838–849Google Scholar
  40. 40.
    Wang J, Li G, Feng J (2012) Can we beat the prefix filtering? An adaptive framework for similarity join and search. In: SIGMOD conference, pp 85–96Google Scholar
  41. 41.
    Xiao C, Wang W, Lin X, Yu JX (2008) Efficient similarity joins for near duplicate detection. In: WWW, pp 131–140Google Scholar
  42. 42.
    Zhang X, Chen K, Shou L, Chen G, Gao Y, Tan K-L (2012) Efficient processing of probabilistic set-containment queries on uncertain set-valued data. Inf Sci 196:97–117MathSciNetCrossRefzbMATHGoogle Scholar
  43. 43.
    Zheng Z, Kohavi R, Mason L (2001) Real world performance of association rule algorithms. In: KDD, pp 401–406Google Scholar
  44. 44.
    Zobel J, Moffat A, Ramamohanarao K (1998) Inverted files versus signature files for text indexing. TOIS 23(4):453–490Google Scholar

Copyright information

© Springer-Verlag London 2015

Authors and Affiliations

  • Panagiotis Bouros
    • 1
    Email author
  • Nikos Mamoulis
    • 2
  • Shen Ge
    • 2
  • Manolis Terrovitis
    • 3
  1. 1.Department of Computer ScienceAarhus UniversityAarhus NDenmark
  2. 2.Department of Computer ScienceThe University of Hong KongHong KongChina
  3. 3.Institute for the Management of Information SystemsResearch Center “Athena”MarousiGreece

Personalised recommendations