Skip to main content

Streaming Set Similarity Joins

  • Conference paper
  • First Online:
Enterprise Information Systems (ICEIS 2020)

Part of the book series: Lecture Notes in Business Information Processing ((LNBIP,volume 417))

Included in the following conference series:

  • 1316 Accesses

Abstract

We consider the problem of efficiently answering set similarity joins over streams. This problem is challenging both in terms of CPU cost, because similarity matching is computationally much more expensive than equality comparisons, and memory requirements, due to the unbounded nature of streams. This article presents SSTR, a novel similarity join algorithm for streams of sets. We adopt the concept of temporal similarity and exploit its properties to improve efficiency and reduce memory usage. Furthermore, we propose a sampling-based technique for ordering set elements that increases the pruning power of SSTR and, thus, reduce even further the number of similarity comparisons and memory consumption. We provide an extensive experimental study on several synthetic as well as real-world datasets. Our results show that the techniques we proposed significantly reduce memory consumption, improve scalability, and lead to substantial performance gains over the baseline approach.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 119.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 159.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    In our implementation, we avoid repeated calculations of candidate-specific thresholds and overlap bounds by storing them in the map M.

  2. 2.

    dblp.uni-trier.de/xml.

  3. 3.

    https://en.wikipedia.org/wiki/Wikipedia:Database_download.

  4. 4.

    https://developer.twitter.com/en/products/tweets.

References

  1. Abadi, D.J., et al.: The design of the borealis stream processing engine. In: Proceedings of the Conference on Innovative Data Systems Research, pp. 277–289 (2005)

    Google Scholar 

  2. Amagata, D., Hara, T., Xiao, C.: Dynamic Set kNN Self-Join. In: Proceedings of the IEEE International Conference on Data Engineering, pp. 818–829 (2019)

    Google Scholar 

  3. Anastasiu, D.C., Karypis, G.: L2AP: fast cosine similarity search with prefix L-2 norm bounds. In: Proceedings of the IEEE International Conference on Data Engineering, pp. 784–795 (2014)

    Google Scholar 

  4. Baayen, R.H.: Word Frequency Distributions, Text, Speech and Language Technology, vol. 18. Kluwer Academic Publishers (2001)

    Google Scholar 

  5. Babcock, B., Babu, S., Datar, M., Motwani, R., Widom, J.: Models and issues in data stream systems. In: Proceedings of the ACM Symposium on Principles of Database Systems, pp. 1–16 (2002)

    Google Scholar 

  6. Baumgartner, J.: Reddit May 2019 submissions. Harv. Dataverse (2019). https://doi.org/10.7910/DVN/JVI8CT

    Article  Google Scholar 

  7. Bayardo, R.J., Ma, Y., Srikant, R.: Scaling up all pairs similarity search. In: Proceedings of the International World Wide Web Conferences, pp. 131–140. ACM (2007)

    Google Scholar 

  8. Broder, A.Z., Charikar, M., Frieze, A.M., Mitzenmacher, M.: Min-wise independent permutations (extended abstract). In: Proceedings of the ACM SIGACT Symposium on Theory of Computing, pp. 327–336. ACM (1998)

    Google Scholar 

  9. Carbone, P., Katsifodimos, A., Ewen, S., Markl, V., Haridi, S., Tzoumas, K.: Apache Flink\(^{\rm TM}\): stream and batch processing in a single engine. IEEE Data Eng. Bull. 38(4), 28–38 (2015)

    Google Scholar 

  10. do Carmo Oliveira, D.J., Borges, F.F., Ribeiro, L.A., Cuzzocrea, A.: Set similarity joins with complex expressions on distributed platforms. In: Proceedings of the Symposium on Advances in Databases and Information Systems, pp. 216–230 (2018)

    Google Scholar 

  11. Chaudhuri, S., Ganti, V., Kaushik, R.: A primitive operator for similarity joins in data cleaning. In: Proceedings of the IEEE International Conference on Data Engineering, p. 5. IEEE Computer Society (2006)

    Google Scholar 

  12. Christiani, T., Pagh, R., Sivertsen, J.: Scalable and robust set similarity join. In: Proceedings of the IEEE International Conference on Data Engineering, pp. 1240–1243. IEEE Computer Society (2018)

    Google Scholar 

  13. Cormode, G., Hadjieleftheriou, M.: Finding frequent items in data streams. Proc. VLDB Endow. 1(2), 1530–1541 (2008)

    Article  Google Scholar 

  14. Deng, F., Rafiei, D.: Approximately detecting duplicates for streaming data using stable bloom filters. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 25–36 (2006)

    Google Scholar 

  15. Dutta, S., Narang, A., Bera, S.K.: Streaming quotient filter: a near optimal approximate duplicate detection approach for data streams. Proc. VLDB Endow. 6(8), 589–600 (2013)

    Article  Google Scholar 

  16. Kraus, N., Carmel, D., Keidar, I.: Fishing in the stream: similarity search over endless data. In: bigdata, pp. 964–969 (2017)

    Google Scholar 

  17. Lian, X., Chen, L.: Efficient similarity join over multiple stream time series. IEEE Trans. Knowl. Data Eng. 21(11), 1544–1558 (2009)

    Article  Google Scholar 

  18. Lian, X., Chen, L.: Set similarity join on probabilistic data. Proc. VLDB Endow. 3(1), 650–659 (2010)

    Article  Google Scholar 

  19. Lian, X., Chen, L.: Similarity join processing on uncertain data streams. IEEE Trans. Knowl. Data Eng. 23(11), 1718–1734 (2011)

    Article  Google Scholar 

  20. Mann, W., Augsten, N., Bouros, P.: An empirical evaluation of set similarity join techniques. PVLDB 9(9), 636–647 (2016)

    Google Scholar 

  21. Metwally, A., Agrawal, D., El Abbadi, A.: Duplicate detection in click streams. In: Proceedings of the International World Wide Web Conferences, pp. 12–21 (2005)

    Google Scholar 

  22. Morales, G.D.F., Gionis, A.: Streaming similarity self-join. Proc. VLDB Endow. 9(10), 792–803 (2016)

    Article  Google Scholar 

  23. Pacífico, L., Ribeiro, L.A.: SSTR: set similarity join over stream data. In: International Conference on Enterprise Information Systems, pp. 52–60. SCITEPRESS (2020)

    Google Scholar 

  24. Quirino, R.D., Ribeiro-Júnior, S., Ribeiro, L.A., Martins, W.S.: fgssjoin: A GPU-based algorithm for set similarity joins. In: International Conference on Enterprise Information Systems, pp. 152–161. SCITEPRESS (2017)

    Google Scholar 

  25. Ribeiro, L.A., Cuzzocrea, A., Bezerra, K.A.A., do Nascimento, B.H.B.: SJClust: towards a framework for integrating similarity join algorithms and clustering. In: International Conference on Enterprise Information Systems, pp. 75–80. SCITEPRESS (2016)

    Google Scholar 

  26. Ribeiro, L.A., Cuzzocrea, A., Bezerra, K.A.A., do Nascimento, B.H.B.: SjClust: a framework for incorporating clustering into set similarity join algorithms. LNCS Trans. Large Scale Data Knowl. Center. Syst. 38, 89–118 (2018)

    Google Scholar 

  27. Ribeiro, L.A., Härder, T.: Generalizing prefix filtering to improve set similarity joins. Inf. Syst. 36(1), 62–78 (2011)

    Article  Google Scholar 

  28. Ribeiro, L.A., Schneider, N.C., de Souza Inácio, A., Wagner, H.M., von Wangenheim, A.: Bridging database applications and declarative similarity matching. J. Inf. Data Manage. 7(3), 217–232 (2016)

    Google Scholar 

  29. Ribeiro-Júnior, S., Quirino, R.D., Ribeiro, L.A., Martins, W.S.: Fast parallel set similarity joins on many-core architectures. J. Inf. Data Manage. 8(3), 255–270 (2017)

    Google Scholar 

  30. Sarawagi, S., Kirpal, A.: Efficient set joins on similarity predicates. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 743–754 (2004)

    Google Scholar 

  31. Shen, Z., Cheema, M.A., Lin, X., Zhang, W., Wang, H.: A generic framework for top-k pairs and top-k objects queries over sliding windows. IEEE Trans. Knowl. Data Eng. 26(6), 1349–1366 (2014)

    Article  Google Scholar 

  32. Sidney, C.F., Mendes, D.S., Ribeiro, L.A., Härder, T.: Performance prediction for set similarity joins. In: Proceedings of the ACM Symposium on Applied Computing, pp. 967–972 (2015)

    Google Scholar 

  33. Stonebraker, M., Çetintemel, U., Zdonik, S.B.: The 8 requirements of real-time stream processing. SIGMOD Rec. 34(4), 42–47 (2005)

    Article  Google Scholar 

  34. Vernica, R., Carey, M.J., Li, C.: Efficient parallel set-similarity joins using MapReduce. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 495–506. ACM (2010)

    Google Scholar 

  35. Vitter, J.S.: Random sampling with a reservoir. ACM Trans. Math. Softw. 11(1), 37–57 (1985)

    Article  MathSciNet  Google Scholar 

  36. Wang, X., Qin, L., Lin, X., Zhang, Y., Chang, L.: Leveraging set relations in exact set similarity join. Proc. VLDB Endow. 10(9), 925–936 (2017)

    Article  Google Scholar 

  37. Xiao, C., Wang, W., Lin, X., Yu, J.X., Wang, G.: Efficient similarity joins for near-duplicate detection. ACM Trans. Database Syst. 36(3), 15:1–15:41 (2011)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Leonardo Andrade Ribeiro .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Pacífico, L., Ribeiro, L.A. (2021). Streaming Set Similarity Joins. In: Filipe, J., Śmiałek, M., Brodsky, A., Hammoudi, S. (eds) Enterprise Information Systems. ICEIS 2020. Lecture Notes in Business Information Processing, vol 417. Springer, Cham. https://doi.org/10.1007/978-3-030-75418-1_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-75418-1_2

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-75417-4

  • Online ISBN: 978-3-030-75418-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics