Skip to main content
Log in

Internal and external memory set containment join

  • Regular Paper
  • Published:
The VLDB Journal Aims and scope Submit manuscript

A Correction to this article was published on 02 April 2021

This article has been updated

Abstract

A set containment join operates on two set-valued attributes with a subset (\(\subseteq \)) relationship as the join condition. It has many real-world applications, such as in publish/subscribe services and inclusion dependency discovery. Existing solutions can be broadly classified into union-oriented and intersection-oriented methods. Based on several recent studies, union-oriented methods are not competitive as they involve an expensive subset enumeration step. Intersection-oriented methods build an inverted index on one attribute and perform inverted list intersection on another attribute. Existing intersection-oriented methods intersect inverted lists one-by-one. In contrast, in this paper, we propose to intersect all the inverted lists simultaneously while skipping many irrelevant entries in the lists. To share computation, we utilize the prefix tree structure and extend our novel list intersection method to operate on the prefix tree. To further improve the efficiency, we propose to partition the data and process each partition separately. Each partition will be associated with a much smaller inverted index, and the set containment join cost can be significantly reduced. Moreover, to support large-scale datasets that are beyond the available memory space, we develop a novel adaptive data partition method that is designed to fully leverage the available memory and achieve high I/O efficiency, and thereby exhibiting outstanding performance for external memory set containment join. We evaluate our methods using both real-world and synthetic datasets. Experimental results demonstrate that our method outperforms state-of-the-art methods by up to 10\(\times \) when the dataset is completely resided in memory. Furthermore, our approach achieves up to two orders of magnitude improvement on I/O efficiency compared with a baseline method when the dataset size exceeds the main memory space.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16

Similar content being viewed by others

Change history

Notes

  1. In our implementation, we use the element frequency order as the global order and use the most frequent element to partition the data.

  2. In the experiment, we empirically set it to 3.

  3. https://www.flickr.com.

  4. http://www.cim.mcgill.ca/~dudek/206/Logs/.

  5. https://snap.stanford.edu/data/com-Orkut.html.

  6. https://snap.stanford.edu/data/twitter-2010.html.

  7. https://www.aminer.cn/citation.

  8. https://www.aminer.cn/data-sna.

  9. http://jmcauley.ucsd.edu/data/amazon/links.html.

  10. http://konect.uni-koblenz.de/networks/delicious-ui.

References

  1. Agrawal, M., Manchanda, K., Soni, R., Lal, A., Chowdary, C.R.: Parallel implementation of local similarity search for unstructured text using prefix filtering. In: International Conference on Parallel and Distributed Computing, Applications and Technologies, pp. 98–103 (2017)

  2. Agrawal, P., Arasu, A., Kaushik, R.: On indexing error-tolerant set containment. In: SIGMOD, pp. 927–938 (2010)

  3. Baraglia, R., Morales, G.D.F., Lucchese, C.: Document similarity self-join with mapreduce. In: ICDM, pp. 731–736 (2010)

  4. Bayardo, R.J., Ma, Y., Srikant, R.P: Scaling up all pairs similarity search. In: WWW, pp. 131–140 (2007)

  5. Bouros, P., Mamoulis, N., Ge, S., Terrovitis, M.: Set containment join revisited. Knowl. Inf. Syst. 49(1), 375–402 (2016)

    Article  Google Scholar 

  6. Dean, J., Ghemawat, S.: Mapreduce: a flexible data processing tool. Commun. ACM 53(1), 72–77 (2010)

    Article  Google Scholar 

  7. Deng, D., Kim, A., Madden, S., Stonebraker, M.: Silkmoth: an efficient method for finding related sets with maximum matching constraints. PVLDB 10(10), 1082–1093 (2017)

    Google Scholar 

  8. Deng, D., Li, G., Hao, S., Wang, J., Feng, J.: Massjoin: a mapreduce-based method for scalable string similarity joins. In: ICDE, pp. 340–351 (2014)

  9. Deng, D., Li, G., Wen, H., Feng, J.: An efficient partition based method for exact set similarity joins. PVLDB 9(4), 360–371 (2015)

    Google Scholar 

  10. Deng, D., Tao, Y., Li, G.: Overlap set similarity joins with theoretical guarantees. In: SIGMOD, pp. 905–920 (2018)

  11. Ding, X., Yang, W., Choo, K.R., Wang, X., Jin, H.: Privacy preserving similarity joins using mapreduce. Inf. Sci. 493, 20–33 (2019)

    Article  Google Scholar 

  12. do Carmo Oliveira, D.J., Borges, F.F., Ribeiro, L.A., Cuzzocrea, A.: Set similarity joins with complex expressions on distributed platforms. In: ADBIS, pp. 216–230 (2018)

  13. Elsayed, T., Lin, J.J., Oard, D.W.: Pairwise document similarity in large collections with mapreduce. In: ACL, pp. 265–268 (2008)

  14. Fier, F., Augsten, N., Bouros, P., Leser, U., Freytag, J.: Set similarity joins on mapreduce: an experimental survey. PVLDB 11(10), 1110–1122 (2018)

    Google Scholar 

  15. Gavagsaz, E., Rezaee, A., Javadi, H.H.S.: Load balancing in join algorithms for skewed data in mapreduce systems. J. Supercomput. 75(1), 228–254 (2019)

    Article  Google Scholar 

  16. Helmer, S., Moerkotte, G.: Evaluation of main memory join algorithms for joins with set comparison join predicates. In: VLDB, pp. 386–395 (1997)

  17. Helmer, S., Moerkotte, G.: A performance study of four index structures for set-valued attributes of low cardinality. VLDB J. 12(3), 244–261 (2003)

    Article  Google Scholar 

  18. Ibrahim, A., Fletcher, G.H.L.: Efficient processing of containment queries on nested sets. In: EDBT, pp. 227–238 (2013)

  19. Jampani, R., Pudi, V.: Using prefix-trees for efficiently computing set joins. In: DASFAA, pp. 761–772 (2005)

  20. Jiang, Y., Li, G., Feng, J., Li, W.: String similarity joins: an experimental evaluation. PVLDB 7(8), 625–636 (2014)

    Google Scholar 

  21. Kunkel, A., Rheinländer, A., Schiefer, C., Helmer, S., Bouros, P., Leser, U.: Piejoin: towards parallel set containment joins. In: SSDBM, pp. 11:1–11:12 (2016)

  22. Li, C., Lu, J., Lu, Y.: Efficient merging and filtering algorithms for approximate string searches. In: ICDE, pp. 257–266 (2008)

  23. Li, G., Deng, D., Feng, J.P.: A partition-based method for string similarity joins with edit-distance constraints. ACM Trans. Database Syst. 38(2), 9:1–9:33 (2013)

    Article  MathSciNet  Google Scholar 

  24. Li, G., Deng, D., Wang, J., Feng, J.: PASS-JOIN: a partition-based method for similarity joins. PVLDB 5(3), 253–264 (2011)

    Google Scholar 

  25. Li, R., Ju, L., Peng, Z., Yu, Z., Wang, C.: Batch text similarity search with mapreduce. In: 13th Asia-Pacific Web Conference, pp. 412–423 (2011)

  26. Liu, W., Shen, Y., Wang, P.: An efficient mapreduce algorithm for similarity join in metric spaces. J. Supercomput. 72(3), 1179–1200 (2016)

    Article  Google Scholar 

  27. Luo, Y., Fletcher, G.H.L., Hidders, J., Bra, P.D.: Efficient and scalable trie-based algorithms for computing set containment relations. In: ICDE, pp. 303–314 (2015)

  28. Mamoulis, N.: Efficient processing of joins on set-valued attributes. In SIGMOD, pp. 157–168 (2003)

  29. Mann, W., Augsten, N., Bouros, P.: An empirical evaluation of set similarity join techniques. PVLDB 9(9), 636–647 (2016)

    Google Scholar 

  30. Melnik, S., Garcia-Molina, H.: Divide-and-conquer algorithm for computing set containment joins. In: EDBT, pp. 427–444 (2002)

  31. Melnik, S., Garcia-Molina, H.: Adaptive algorithms for set containment joins. ACM Trans. Database Syst. 28, 56–99 (2003)

    Article  Google Scholar 

  32. Metwally, A., Faloutsos, C.: V-smart-join: a scalable mapreduce framework for all-pair similarity joins of multisets and vectors. PVLDB 5(8), 704–715 (2012)

    Google Scholar 

  33. Newman, M.E.J.: Power laws, Pareto distributions and Zipf’s law. Contemp. Phys. 46, 323–351 (2005)

    Article  Google Scholar 

  34. Qin, J., Xiao, C.: Pigeonring: a principle for faster thresholded similarity search. PVLDB 12(1), 28–42 (2018)

    Google Scholar 

  35. Ramasamy, K., Patel, J.M., Naughton, J.F., Kaushik, R.P: Set containment joins: the good, the bad and the ugly. In: VLDB, pp. 351–362 (2000)

  36. Roberts, C.: Partial-match retrieval via the method of superimposed codes. Proc. IEEE 67(12), 1624–1642 (1979)

    Article  Google Scholar 

  37. Rong, C., Lin, C., Silva, Y.N., Wang, J., Lu, W., Du, X.: Fast and scalable distributed set similarity joins for big data analytics. In: ICDE, pp. 1059–1070 (2017)

  38. Rong, C., Lu, W., Wang, X., Du, X., Chen, Y., Tung, A.K.H.: Efficient and scalable processing of string similarity join. IEEE Trans. Knowl. Data Eng. 25(10), 2217–2230 (2013)

    Article  Google Scholar 

  39. Sarma, A.D., He, Y., Chaudhuri, S.: Clusterjoin: a similarity joins framework using map-reduce. PVLDB 7(12), 1059–1070 (2014)

    Google Scholar 

  40. Silva, Y.N., Reed, J.M.: Exploiting mapreduce-based similarity joins. In: SIGMOD, pp. 693–696 (2012)

  41. Sun, J., Shang, Z., Li, G., Bao, Z., Deng, D.: Balance-aware distributed string similarity-based query processing system. PVLDB 12(9), 961–974 (2019)

    Google Scholar 

  42. Sun, J., Shang, Z., Li, G., Deng, D., Bao, Z.: Dima: a distributed in-memory similarity-based query processing system. PVLDB 10(12), 1925–1928 (2017)

    Google Scholar 

  43. Terrovitis, M., Bouros, P., Vassiliadis, P., Sellis, T.K., Mamoulis, N.: Efficient answering of set containment queries for skewed item distributions. In: EDBT, pp. 225–236 (2011)

  44. Terrovitis, M., Liagouris, J., Mamoulis, N., Skiadopoulos, S.: Privacy preservation by disassociation. PVLDB 5(10), 944–955 (2012)

    Google Scholar 

  45. Terrovitis, M., Mamoulis, N., Kalnis, P.: Privacy-preserving anonymization of set-valued data. PVLDB 1(1), 115–125 (2008)

    Google Scholar 

  46. Terrovitis, M., Mamoulis, N., Kalnis, P.: Local and global recoding methods for anonymizing set-valued data. VLDB J. 20(1), 83–106 (2011)

    Article  Google Scholar 

  47. Terrovitis, M., Passas, S., Vassiliadis, P., Sellis, T.K.: A combination of trie-trees and inverted files for the indexing of set-valued attributes. In: CIKM, pp. 728–737 (2006)

  48. Vernica, R., Carey, M.J., Li, C.: Efficient parallel set-similarity joins using mapreduce. In: SIGMOD, pp. 495–506 (2010)

  49. Wandelt, S., Deng, D., Gerdjikov, S., Mishra, S., Mitankin, P., Patil, M., Siragusa, E., Tiskin, A., Wang, W., Wang, J., Leser, U.: State-of-the-art in string similarity search and join. SIGMOD Record 43(1), 64–76 (2014)

    Article  Google Scholar 

  50. Wang, J., Li, G., Feng, J.: Can we beat the prefix filtering?: An adaptive framework for similarity join and search. In: SIGMOD, pp. 85–96 (2012)

  51. Wang, L., von Laszewski, G., Younge, A.J., He, X., Kunze, M., Tao, J., Fu, C.: Cloud computing: a perspective study. New Gener. Comput. 28(2), 137–146 (2010)

    Article  Google Scholar 

  52. Wang, P., Xiao, C., Qin, J., Wang, W., Zhang, X., Ishikawa, Y.: Local similarity search for unstructured text. In: SIGMOD, pp. 1991–2005 (2016)

  53. Wang, X., Qin, L., Lin, X., Zhang, Y., Chang, L.: Leveraging set relations in exact set similarity join. PVLDB 10(9), 925–936 (2017)

    Google Scholar 

  54. Wang, X., Qin, L., Lin, X., Zhang, Y., Chang, L.: Leveraging set relations in exact and dynamic set similarity join. VLDB J. 28(2), 267–292 (2019)

    Article  Google Scholar 

  55. Xiao, C., Wang, W., Lin, X.: Ed-join: an efficient algorithm for similarity joins with edit distance constraints. PVLDB 1(1), 933–944 (2008)

    MathSciNet  Google Scholar 

  56. Xiao, C., Wang, W., Lin, X., Shang, H.: Top-k set similarity joins. In: ICDE, pp. 916–927 (2009)

  57. Xiao, C., Wang, W., Lin, X., Yu, J.X.: Efficient similarity joins for near duplicate detection. In: WWW, pp. 131–140 (2008)

  58. Yang, J., Zhang, W., Yang, S., Zhang, Y., Lin, X.: Tt-join: efficient set containment join. In: ICDE, pp. 509–520 (2017)

  59. Yang, J., Zhang, W., Yang, S., Zhang, Y., Lin, X., Yuan, L.: Efficient set containment join. VLDB J. 27(4), 471–495 (2018)

    Article  Google Scholar 

  60. Yang, Y., Zhang, W., Zhang, Y., Lin, X., Wang, L.: Selectivity estimation on set containment search. In: DASFAA, pp. 330–349 (2019)

  61. Yu, M., Li, G., Deng, D., Feng, J.: String similarity search and join: a survey. Front. Comput. Sci. 10(3), 399–417 (2016)

    Article  Google Scholar 

  62. Zaharia, M., Xin, R.S., Wendell, P., Das, T., Armbrust, M., Dave, A., Meng, X., Rosen, J., Venkataraman, S., Franklin, M.J., Ghodsi, A., Gonzalez, J., Shenker, S., Stoica, I.: Apache spark: a unified engine for big data processing. Commun. ACM 59(11), 56–65 (2016)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Dong Deng or Shuo Shang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

The original online version of this article was revised due to update in co corresponding author.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yang, C., Deng, D., Shang, S. et al. Internal and external memory set containment join. The VLDB Journal 30, 447–470 (2021). https://doi.org/10.1007/s00778-020-00644-3

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00778-020-00644-3

Keywords

Navigation