Streaming Set Similarity Joins

Pacífico, Lucas; Ribeiro, Leonardo Andrade

doi:10.1007/978-3-030-75418-1_2

Lucas Pacífico¹⁰ &
Leonardo Andrade Ribeiro¹⁰

Part of the book series: Lecture Notes in Business Information Processing ((LNBIP,volume 417))

Included in the following conference series:

International Conference on Enterprise Information Systems

1316 Accesses

Abstract

We consider the problem of efficiently answering set similarity joins over streams. This problem is challenging both in terms of CPU cost, because similarity matching is computationally much more expensive than equality comparisons, and memory requirements, due to the unbounded nature of streams. This article presents SSTR, a novel similarity join algorithm for streams of sets. We adopt the concept of temporal similarity and exploit its properties to improve efficiency and reduce memory usage. Furthermore, we propose a sampling-based technique for ordering set elements that increases the pruning power of SSTR and, thus, reduce even further the number of similarity comparisons and memory consumption. We provide an extensive experimental study on several synthetic as well as real-world datasets. Our results show that the techniques we proposed significantly reduce memory consumption, improve scalability, and lead to substantial performance gains over the baseline approach.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 119.00; Price excludes VAT (USA)

Softcover Book: USD 159.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
In our implementation, we avoid repeated calculations of candidate-specific thresholds and overlap bounds by storing them in the map M.
2.
dblp.uni-trier.de/xml.
3.
https://en.wikipedia.org/wiki/Wikipedia:Database_download.
4.
https://developer.twitter.com/en/products/tweets.

References

Abadi, D.J., et al.: The design of the borealis stream processing engine. In: Proceedings of the Conference on Innovative Data Systems Research, pp. 277–289 (2005)
Google Scholar
Amagata, D., Hara, T., Xiao, C.: Dynamic Set kNN Self-Join. In: Proceedings of the IEEE International Conference on Data Engineering, pp. 818–829 (2019)
Google Scholar
Anastasiu, D.C., Karypis, G.: L2AP: fast cosine similarity search with prefix L-2 norm bounds. In: Proceedings of the IEEE International Conference on Data Engineering, pp. 784–795 (2014)
Google Scholar
Baayen, R.H.: Word Frequency Distributions, Text, Speech and Language Technology, vol. 18. Kluwer Academic Publishers (2001)
Google Scholar
Babcock, B., Babu, S., Datar, M., Motwani, R., Widom, J.: Models and issues in data stream systems. In: Proceedings of the ACM Symposium on Principles of Database Systems, pp. 1–16 (2002)
Google Scholar
Baumgartner, J.: Reddit May 2019 submissions. Harv. Dataverse (2019). https://doi.org/10.7910/DVN/JVI8CT
Article Google Scholar
Bayardo, R.J., Ma, Y., Srikant, R.: Scaling up all pairs similarity search. In: Proceedings of the International World Wide Web Conferences, pp. 131–140. ACM (2007)
Google Scholar
Broder, A.Z., Charikar, M., Frieze, A.M., Mitzenmacher, M.: Min-wise independent permutations (extended abstract). In: Proceedings of the ACM SIGACT Symposium on Theory of Computing, pp. 327–336. ACM (1998)
Google Scholar
Carbone, P., Katsifodimos, A., Ewen, S., Markl, V., Haridi, S., Tzoumas, K.: Apache Flink\(^{\rm TM}\): stream and batch processing in a single engine. IEEE Data Eng. Bull. 38(4), 28–38 (2015)
Google Scholar
do Carmo Oliveira, D.J., Borges, F.F., Ribeiro, L.A., Cuzzocrea, A.: Set similarity joins with complex expressions on distributed platforms. In: Proceedings of the Symposium on Advances in Databases and Information Systems, pp. 216–230 (2018)
Google Scholar
Chaudhuri, S., Ganti, V., Kaushik, R.: A primitive operator for similarity joins in data cleaning. In: Proceedings of the IEEE International Conference on Data Engineering, p. 5. IEEE Computer Society (2006)
Google Scholar
Christiani, T., Pagh, R., Sivertsen, J.: Scalable and robust set similarity join. In: Proceedings of the IEEE International Conference on Data Engineering, pp. 1240–1243. IEEE Computer Society (2018)
Google Scholar
Cormode, G., Hadjieleftheriou, M.: Finding frequent items in data streams. Proc. VLDB Endow. 1(2), 1530–1541 (2008)
Article Google Scholar
Deng, F., Rafiei, D.: Approximately detecting duplicates for streaming data using stable bloom filters. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 25–36 (2006)
Google Scholar
Dutta, S., Narang, A., Bera, S.K.: Streaming quotient filter: a near optimal approximate duplicate detection approach for data streams. Proc. VLDB Endow. 6(8), 589–600 (2013)
Article Google Scholar
Kraus, N., Carmel, D., Keidar, I.: Fishing in the stream: similarity search over endless data. In: bigdata, pp. 964–969 (2017)
Google Scholar
Lian, X., Chen, L.: Efficient similarity join over multiple stream time series. IEEE Trans. Knowl. Data Eng. 21(11), 1544–1558 (2009)
Article Google Scholar
Lian, X., Chen, L.: Set similarity join on probabilistic data. Proc. VLDB Endow. 3(1), 650–659 (2010)
Article Google Scholar
Lian, X., Chen, L.: Similarity join processing on uncertain data streams. IEEE Trans. Knowl. Data Eng. 23(11), 1718–1734 (2011)
Article Google Scholar
Mann, W., Augsten, N., Bouros, P.: An empirical evaluation of set similarity join techniques. PVLDB 9(9), 636–647 (2016)
Google Scholar
Metwally, A., Agrawal, D., El Abbadi, A.: Duplicate detection in click streams. In: Proceedings of the International World Wide Web Conferences, pp. 12–21 (2005)
Google Scholar
Morales, G.D.F., Gionis, A.: Streaming similarity self-join. Proc. VLDB Endow. 9(10), 792–803 (2016)
Article Google Scholar
Pacífico, L., Ribeiro, L.A.: SSTR: set similarity join over stream data. In: International Conference on Enterprise Information Systems, pp. 52–60. SCITEPRESS (2020)
Google Scholar
Quirino, R.D., Ribeiro-Júnior, S., Ribeiro, L.A., Martins, W.S.: fgssjoin: A GPU-based algorithm for set similarity joins. In: International Conference on Enterprise Information Systems, pp. 152–161. SCITEPRESS (2017)
Google Scholar
Ribeiro, L.A., Cuzzocrea, A., Bezerra, K.A.A., do Nascimento, B.H.B.: SJClust: towards a framework for integrating similarity join algorithms and clustering. In: International Conference on Enterprise Information Systems, pp. 75–80. SCITEPRESS (2016)
Google Scholar
Ribeiro, L.A., Cuzzocrea, A., Bezerra, K.A.A., do Nascimento, B.H.B.: SjClust: a framework for incorporating clustering into set similarity join algorithms. LNCS Trans. Large Scale Data Knowl. Center. Syst. 38, 89–118 (2018)
Google Scholar
Ribeiro, L.A., Härder, T.: Generalizing prefix filtering to improve set similarity joins. Inf. Syst. 36(1), 62–78 (2011)
Article Google Scholar
Ribeiro, L.A., Schneider, N.C., de Souza Inácio, A., Wagner, H.M., von Wangenheim, A.: Bridging database applications and declarative similarity matching. J. Inf. Data Manage. 7(3), 217–232 (2016)
Google Scholar
Ribeiro-Júnior, S., Quirino, R.D., Ribeiro, L.A., Martins, W.S.: Fast parallel set similarity joins on many-core architectures. J. Inf. Data Manage. 8(3), 255–270 (2017)
Google Scholar
Sarawagi, S., Kirpal, A.: Efficient set joins on similarity predicates. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 743–754 (2004)
Google Scholar
Shen, Z., Cheema, M.A., Lin, X., Zhang, W., Wang, H.: A generic framework for top-k pairs and top-k objects queries over sliding windows. IEEE Trans. Knowl. Data Eng. 26(6), 1349–1366 (2014)
Article Google Scholar
Sidney, C.F., Mendes, D.S., Ribeiro, L.A., Härder, T.: Performance prediction for set similarity joins. In: Proceedings of the ACM Symposium on Applied Computing, pp. 967–972 (2015)
Google Scholar
Stonebraker, M., Çetintemel, U., Zdonik, S.B.: The 8 requirements of real-time stream processing. SIGMOD Rec. 34(4), 42–47 (2005)
Article Google Scholar
Vernica, R., Carey, M.J., Li, C.: Efficient parallel set-similarity joins using MapReduce. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 495–506. ACM (2010)
Google Scholar
Vitter, J.S.: Random sampling with a reservoir. ACM Trans. Math. Softw. 11(1), 37–57 (1985)
Article MathSciNet Google Scholar
Wang, X., Qin, L., Lin, X., Zhang, Y., Chang, L.: Leveraging set relations in exact set similarity join. Proc. VLDB Endow. 10(9), 925–936 (2017)
Article Google Scholar
Xiao, C., Wang, W., Lin, X., Yu, J.X., Wang, G.: Efficient similarity joins for near-duplicate detection. ACM Trans. Database Syst. 36(3), 15:1–15:41 (2011)
Google Scholar

Download references

Author information

Authors and Affiliations

Instituto de Informática, Universidade Federal de Goiás, Goiânia, Brazil
Lucas Pacífico & Leonardo Andrade Ribeiro

Authors

Lucas Pacífico
View author publications
You can also search for this author in PubMed Google Scholar
Leonardo Andrade Ribeiro
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Leonardo Andrade Ribeiro .

Editor information

Editors and Affiliations

Polytechnic Institute of Setúbal/INSTICC, Setúbal, Portugal
Joaquim Filipe
Warsaw University of Technology, Warsaw, Poland
Michał Śmiałek
George Mason University, Fairfax, VA, USA
Alexander Brodsky
MODESTE/ESEO, Angers, France
Slimane Hammoudi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Pacífico, L., Ribeiro, L.A. (2021). Streaming Set Similarity Joins. In: Filipe, J., Śmiałek, M., Brodsky, A., Hammoudi, S. (eds) Enterprise Information Systems. ICEIS 2020. Lecture Notes in Business Information Processing, vol 417. Springer, Cham. https://doi.org/10.1007/978-3-030-75418-1_2

Download citation

DOI: https://doi.org/10.1007/978-3-030-75418-1_2
Published: 01 May 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-75417-4
Online ISBN: 978-3-030-75418-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics