Abstract
A set containment join operates on two set-valued attributes with a subset (\(\subseteq \)) relationship as the join condition. It has many real-world applications, such as in publish/subscribe services and inclusion dependency discovery. Existing solutions can be broadly classified into union-oriented and intersection-oriented methods. Based on several recent studies, union-oriented methods are not competitive as they involve an expensive subset enumeration step. Intersection-oriented methods build an inverted index on one attribute and perform inverted list intersection on another attribute. Existing intersection-oriented methods intersect inverted lists one-by-one. In contrast, in this paper, we propose to intersect all the inverted lists simultaneously while skipping many irrelevant entries in the lists. To share computation, we utilize the prefix tree structure and extend our novel list intersection method to operate on the prefix tree. To further improve the efficiency, we propose to partition the data and process each partition separately. Each partition will be associated with a much smaller inverted index, and the set containment join cost can be significantly reduced. Moreover, to support large-scale datasets that are beyond the available memory space, we develop a novel adaptive data partition method that is designed to fully leverage the available memory and achieve high I/O efficiency, and thereby exhibiting outstanding performance for external memory set containment join. We evaluate our methods using both real-world and synthetic datasets. Experimental results demonstrate that our method outperforms state-of-the-art methods by up to 10\(\times \) when the dataset is completely resided in memory. Furthermore, our approach achieves up to two orders of magnitude improvement on I/O efficiency compared with a baseline method when the dataset size exceeds the main memory space.
Similar content being viewed by others
Change history
02 April 2021
A Correction to this paper has been published: https://doi.org/10.1007/s00778-021-00662-9
Notes
In our implementation, we use the element frequency order as the global order and use the most frequent element to partition the data.
In the experiment, we empirically set it to 3.
References
Agrawal, M., Manchanda, K., Soni, R., Lal, A., Chowdary, C.R.: Parallel implementation of local similarity search for unstructured text using prefix filtering. In: International Conference on Parallel and Distributed Computing, Applications and Technologies, pp. 98–103 (2017)
Agrawal, P., Arasu, A., Kaushik, R.: On indexing error-tolerant set containment. In: SIGMOD, pp. 927–938 (2010)
Baraglia, R., Morales, G.D.F., Lucchese, C.: Document similarity self-join with mapreduce. In: ICDM, pp. 731–736 (2010)
Bayardo, R.J., Ma, Y., Srikant, R.P: Scaling up all pairs similarity search. In: WWW, pp. 131–140 (2007)
Bouros, P., Mamoulis, N., Ge, S., Terrovitis, M.: Set containment join revisited. Knowl. Inf. Syst. 49(1), 375–402 (2016)
Dean, J., Ghemawat, S.: Mapreduce: a flexible data processing tool. Commun. ACM 53(1), 72–77 (2010)
Deng, D., Kim, A., Madden, S., Stonebraker, M.: Silkmoth: an efficient method for finding related sets with maximum matching constraints. PVLDB 10(10), 1082–1093 (2017)
Deng, D., Li, G., Hao, S., Wang, J., Feng, J.: Massjoin: a mapreduce-based method for scalable string similarity joins. In: ICDE, pp. 340–351 (2014)
Deng, D., Li, G., Wen, H., Feng, J.: An efficient partition based method for exact set similarity joins. PVLDB 9(4), 360–371 (2015)
Deng, D., Tao, Y., Li, G.: Overlap set similarity joins with theoretical guarantees. In: SIGMOD, pp. 905–920 (2018)
Ding, X., Yang, W., Choo, K.R., Wang, X., Jin, H.: Privacy preserving similarity joins using mapreduce. Inf. Sci. 493, 20–33 (2019)
do Carmo Oliveira, D.J., Borges, F.F., Ribeiro, L.A., Cuzzocrea, A.: Set similarity joins with complex expressions on distributed platforms. In: ADBIS, pp. 216–230 (2018)
Elsayed, T., Lin, J.J., Oard, D.W.: Pairwise document similarity in large collections with mapreduce. In: ACL, pp. 265–268 (2008)
Fier, F., Augsten, N., Bouros, P., Leser, U., Freytag, J.: Set similarity joins on mapreduce: an experimental survey. PVLDB 11(10), 1110–1122 (2018)
Gavagsaz, E., Rezaee, A., Javadi, H.H.S.: Load balancing in join algorithms for skewed data in mapreduce systems. J. Supercomput. 75(1), 228–254 (2019)
Helmer, S., Moerkotte, G.: Evaluation of main memory join algorithms for joins with set comparison join predicates. In: VLDB, pp. 386–395 (1997)
Helmer, S., Moerkotte, G.: A performance study of four index structures for set-valued attributes of low cardinality. VLDB J. 12(3), 244–261 (2003)
Ibrahim, A., Fletcher, G.H.L.: Efficient processing of containment queries on nested sets. In: EDBT, pp. 227–238 (2013)
Jampani, R., Pudi, V.: Using prefix-trees for efficiently computing set joins. In: DASFAA, pp. 761–772 (2005)
Jiang, Y., Li, G., Feng, J., Li, W.: String similarity joins: an experimental evaluation. PVLDB 7(8), 625–636 (2014)
Kunkel, A., Rheinländer, A., Schiefer, C., Helmer, S., Bouros, P., Leser, U.: Piejoin: towards parallel set containment joins. In: SSDBM, pp. 11:1–11:12 (2016)
Li, C., Lu, J., Lu, Y.: Efficient merging and filtering algorithms for approximate string searches. In: ICDE, pp. 257–266 (2008)
Li, G., Deng, D., Feng, J.P.: A partition-based method for string similarity joins with edit-distance constraints. ACM Trans. Database Syst. 38(2), 9:1–9:33 (2013)
Li, G., Deng, D., Wang, J., Feng, J.: PASS-JOIN: a partition-based method for similarity joins. PVLDB 5(3), 253–264 (2011)
Li, R., Ju, L., Peng, Z., Yu, Z., Wang, C.: Batch text similarity search with mapreduce. In: 13th Asia-Pacific Web Conference, pp. 412–423 (2011)
Liu, W., Shen, Y., Wang, P.: An efficient mapreduce algorithm for similarity join in metric spaces. J. Supercomput. 72(3), 1179–1200 (2016)
Luo, Y., Fletcher, G.H.L., Hidders, J., Bra, P.D.: Efficient and scalable trie-based algorithms for computing set containment relations. In: ICDE, pp. 303–314 (2015)
Mamoulis, N.: Efficient processing of joins on set-valued attributes. In SIGMOD, pp. 157–168 (2003)
Mann, W., Augsten, N., Bouros, P.: An empirical evaluation of set similarity join techniques. PVLDB 9(9), 636–647 (2016)
Melnik, S., Garcia-Molina, H.: Divide-and-conquer algorithm for computing set containment joins. In: EDBT, pp. 427–444 (2002)
Melnik, S., Garcia-Molina, H.: Adaptive algorithms for set containment joins. ACM Trans. Database Syst. 28, 56–99 (2003)
Metwally, A., Faloutsos, C.: V-smart-join: a scalable mapreduce framework for all-pair similarity joins of multisets and vectors. PVLDB 5(8), 704–715 (2012)
Newman, M.E.J.: Power laws, Pareto distributions and Zipf’s law. Contemp. Phys. 46, 323–351 (2005)
Qin, J., Xiao, C.: Pigeonring: a principle for faster thresholded similarity search. PVLDB 12(1), 28–42 (2018)
Ramasamy, K., Patel, J.M., Naughton, J.F., Kaushik, R.P: Set containment joins: the good, the bad and the ugly. In: VLDB, pp. 351–362 (2000)
Roberts, C.: Partial-match retrieval via the method of superimposed codes. Proc. IEEE 67(12), 1624–1642 (1979)
Rong, C., Lin, C., Silva, Y.N., Wang, J., Lu, W., Du, X.: Fast and scalable distributed set similarity joins for big data analytics. In: ICDE, pp. 1059–1070 (2017)
Rong, C., Lu, W., Wang, X., Du, X., Chen, Y., Tung, A.K.H.: Efficient and scalable processing of string similarity join. IEEE Trans. Knowl. Data Eng. 25(10), 2217–2230 (2013)
Sarma, A.D., He, Y., Chaudhuri, S.: Clusterjoin: a similarity joins framework using map-reduce. PVLDB 7(12), 1059–1070 (2014)
Silva, Y.N., Reed, J.M.: Exploiting mapreduce-based similarity joins. In: SIGMOD, pp. 693–696 (2012)
Sun, J., Shang, Z., Li, G., Bao, Z., Deng, D.: Balance-aware distributed string similarity-based query processing system. PVLDB 12(9), 961–974 (2019)
Sun, J., Shang, Z., Li, G., Deng, D., Bao, Z.: Dima: a distributed in-memory similarity-based query processing system. PVLDB 10(12), 1925–1928 (2017)
Terrovitis, M., Bouros, P., Vassiliadis, P., Sellis, T.K., Mamoulis, N.: Efficient answering of set containment queries for skewed item distributions. In: EDBT, pp. 225–236 (2011)
Terrovitis, M., Liagouris, J., Mamoulis, N., Skiadopoulos, S.: Privacy preservation by disassociation. PVLDB 5(10), 944–955 (2012)
Terrovitis, M., Mamoulis, N., Kalnis, P.: Privacy-preserving anonymization of set-valued data. PVLDB 1(1), 115–125 (2008)
Terrovitis, M., Mamoulis, N., Kalnis, P.: Local and global recoding methods for anonymizing set-valued data. VLDB J. 20(1), 83–106 (2011)
Terrovitis, M., Passas, S., Vassiliadis, P., Sellis, T.K.: A combination of trie-trees and inverted files for the indexing of set-valued attributes. In: CIKM, pp. 728–737 (2006)
Vernica, R., Carey, M.J., Li, C.: Efficient parallel set-similarity joins using mapreduce. In: SIGMOD, pp. 495–506 (2010)
Wandelt, S., Deng, D., Gerdjikov, S., Mishra, S., Mitankin, P., Patil, M., Siragusa, E., Tiskin, A., Wang, W., Wang, J., Leser, U.: State-of-the-art in string similarity search and join. SIGMOD Record 43(1), 64–76 (2014)
Wang, J., Li, G., Feng, J.: Can we beat the prefix filtering?: An adaptive framework for similarity join and search. In: SIGMOD, pp. 85–96 (2012)
Wang, L., von Laszewski, G., Younge, A.J., He, X., Kunze, M., Tao, J., Fu, C.: Cloud computing: a perspective study. New Gener. Comput. 28(2), 137–146 (2010)
Wang, P., Xiao, C., Qin, J., Wang, W., Zhang, X., Ishikawa, Y.: Local similarity search for unstructured text. In: SIGMOD, pp. 1991–2005 (2016)
Wang, X., Qin, L., Lin, X., Zhang, Y., Chang, L.: Leveraging set relations in exact set similarity join. PVLDB 10(9), 925–936 (2017)
Wang, X., Qin, L., Lin, X., Zhang, Y., Chang, L.: Leveraging set relations in exact and dynamic set similarity join. VLDB J. 28(2), 267–292 (2019)
Xiao, C., Wang, W., Lin, X.: Ed-join: an efficient algorithm for similarity joins with edit distance constraints. PVLDB 1(1), 933–944 (2008)
Xiao, C., Wang, W., Lin, X., Shang, H.: Top-k set similarity joins. In: ICDE, pp. 916–927 (2009)
Xiao, C., Wang, W., Lin, X., Yu, J.X.: Efficient similarity joins for near duplicate detection. In: WWW, pp. 131–140 (2008)
Yang, J., Zhang, W., Yang, S., Zhang, Y., Lin, X.: Tt-join: efficient set containment join. In: ICDE, pp. 509–520 (2017)
Yang, J., Zhang, W., Yang, S., Zhang, Y., Lin, X., Yuan, L.: Efficient set containment join. VLDB J. 27(4), 471–495 (2018)
Yang, Y., Zhang, W., Zhang, Y., Lin, X., Wang, L.: Selectivity estimation on set containment search. In: DASFAA, pp. 330–349 (2019)
Yu, M., Li, G., Deng, D., Feng, J.: String similarity search and join: a survey. Front. Comput. Sci. 10(3), 399–417 (2016)
Zaharia, M., Xin, R.S., Wendell, P., Das, T., Armbrust, M., Dave, A., Meng, X., Rosen, J., Venkataraman, S., Franklin, M.J., Ghodsi, A., Gonzalez, J., Shenker, S., Stoica, I.: Apache spark: a unified engine for big data processing. Commun. ACM 59(11), 56–65 (2016)
Author information
Authors and Affiliations
Corresponding authors
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
The original online version of this article was revised due to update in co corresponding author.
Rights and permissions
About this article
Cite this article
Yang, C., Deng, D., Shang, S. et al. Internal and external memory set containment join. The VLDB Journal 30, 447–470 (2021). https://doi.org/10.1007/s00778-020-00644-3
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00778-020-00644-3