Diversity Similarity Join for Big Data

Silva, Yasin N.; Martinez, Juan; Castro Cea, Pedro; Razente, Humberto; Nardini Barioni, Maria C.

doi:10.1007/978-3-031-46994-7_20

Yasin N. Silva⁹,
Juan Martinez⁹,
Pedro Castro Cea⁹,
Humberto Razente¹⁰ &
…
Maria C. Nardini Barioni¹⁰

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14289))

Included in the following conference series:

International Conference on Similarity Search and Applications

242 Accesses

Abstract

The Similarity Join (SJ) has become one of the most popular and valuable data processing operators in analyzing large amounts of data. Various types of similarity join operators have been effectively used in multiple scenarios. However, these operators usually generate a large output size and many similar output pairs that represent almost the same information. In previous work, a new operator called Diversity Similarity Join (DSJ) has been proposed to address these issues. DSJ generates a smaller scale output and more meaningful and diverse result pairs. This operator, however, was proposed as a single node operator crucially limiting its scalability properties. In this paper, we propose the Distributed Diversity Similarity Join (D2SJ) operator, an approach that enables SJ diversification on big datasets. We present the design guidelines and implementation details on Apache Spark, a popular big data processing framework. Our experimental results with real-world high-dimensional data show that the proposed operator has excellent performance and scalability properties.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 49.99; Price excludes VAT (USA)

Softcover Book: USD 64.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Böhm, C., Braunmüller, B., Krebs, F., Kriegel, H.-P.: Epsilon grid order: an algorithm for the similarity join on massive high-dimensional data. In: SIGMOD (2001)
Google Scholar
Santos, L.F.D., Carvalho, L.O., Oliveira, W.D., Traina, A.J.M., Traina Jr., C.: Diversity in similarity joins. In: Amato, G., Connor, R., Falchi, F., Gennaro, C. (eds.) SISAP 2015. LNCS, vol. 9371, pp. 42–53. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-25087-8_4
Apache. Spark. https://spark.apache.org/
SimCloud Research Team. D2SJ Source Code. https://ysilva.cs.luc.edu/SimCloud/downloads.html
Dohnal, V., Gennaro, C., Zezula, P.: Similarity join in metric spaces using ED-Index. In: Mařík, V., Retschitzegger, W., Štěpánková, O. (eds.) DEXA 2003. LNCS, vol. 2736, pp. 484–493. Springer, Heidelberg (2003). https://doi.org/10.1007/978-3-540-45227-0_48
Chapter Google Scholar
Dohnal, V., Gennaro, C., Savino, P., Zezula, P.: Similarity join in metric spaces. In: Sebastiani, F. (ed.) ECIR 2003. LNCS, vol. 2633, pp. 452–467. Springer, Heidelberg (2003). https://doi.org/10.1007/3-540-36618-0_32
Chapter MATH Google Scholar
Jacox, E.H., Samet, H.: Metric space similarity joins. ACM Trans. Database Syst. 33(2), 1–38 (2008). https://doi.org/10.1145/1366102.1366104
Article Google Scholar
Silva, Y.N., Reed, J.M., Tsosie, L.M.: MapReduce-based similarity join for metric spaces. In: VLDB/Cloud-I (2012)
Google Scholar
Hjaltason, G.R., Samet, H.: Incremental distance join algorithms for spatial databases. In: SIGMOD (1998)
Google Scholar
Böhm, C., Krebs, F.: The k-nearest neighbour join: turbo charging the KDD process. KAIS 6, 728–749 (2004)
Google Scholar
Apache. Hadoop. https://hadoop.apache.org/
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: OSDI (2004)
Google Scholar
Silva, Y.N., Reed, J., Brown, K., Wadsworth, A., Rong, C.: An experimental survey of MapReduce-based similarity joins. In: Amsaleg, L., Houle, M.E., Schubert, E. (eds.) SISAP 2016. LNCS, vol. 9939, pp. 181–195. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46759-7_14
Chapter Google Scholar
Fier, F., Augsten, N., Bouros, P., Leser, U., Freytag, J.-C.: Set similarity joins on MapReduce: an experimental survey. Proc. VLDB Endow. 11(10), 1110–1122 (2018). https://doi.org/10.14778/3231751.3231760
Article Google Scholar
Afrati, F.N., Sarma, A.D., Menestrina, D., Parameswaran, A., Ullman, J.D.: Fuzzy joins using MapReduce. In: ICDE (2012)
Google Scholar
Vernica, R., Carey, M.J., Li, C.: Efficient parallel set-similarity joins using MapReduce. In: SIGMOD (2010)
Google Scholar
Metwally, A., Faloutsos, C.: V-SMART-join: a scalable MapReduce framework for all-pair similarity joins of multisets and vectors. Proc. VLDB Endow. 5(8), 704–715 (2012). https://doi.org/10.14778/2212351.2212353
Article Google Scholar
Silva, Y.N., Reed, J.M.: Exploiting MapReduce-based similarity joins. In: SIGMOD (2012)
Google Scholar
Okcan, A., Riedewald, M.: Processing theta-joins using MapReduce. In: SIGMOD (2011)
Google Scholar
Drosou, M., Pitoura, E.: DisC diversity: result diversification based on dissimilarity and coverage. In: CIKM (2010)
Google Scholar
Vieira, M.R., et al.: On query result diversification. Inf. Syst. 42, 57–77 (2014)
Google Scholar
Ge, X., Chrysanthis, P.K.: PrefDiv: efficient algorithms for effective top-k result diversification. In: EDBT (2020)
Google Scholar
Silva, Y.N., Sandoval, M., Prado, D., Wallace, X., Rong, C.: Similarity grouping in big data systems. In: Amato, G., Gennaro, C., Oria, V., Radovanović, M. (eds.) SISAP 2019. LNCS, vol. 11807, pp. 212–220. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-32047-8_19
Chapter Google Scholar
Bolettieri, P., et al.: CoPhIR: A Test Collection for Content-Based Image Retrieval. arXiv:0905.4627 (2009)
Hjaltason, G.R., Samet, H.: Index-driven similarity search in metric spaces (survey article). TODS 28, 517–580 (2003)
Article Google Scholar

Download references

Acknowledgments

This project was supported by an award from the Google Cloud Research program. The authors would like to thank Steven Hu, Timothy Raymer, and Steven Anderson for their contributions in the preliminary stages of this project.

Author information

Authors and Affiliations

Loyola University Chicago, Chicago, USA
Yasin N. Silva, Juan Martinez & Pedro Castro Cea
Universidade Federal de Uberlandia, Uberlândia, Brazil
Humberto Razente & Maria C. Nardini Barioni

Authors

Yasin N. Silva
View author publications
You can also search for this author in PubMed Google Scholar
Juan Martinez
View author publications
You can also search for this author in PubMed Google Scholar
Pedro Castro Cea
View author publications
You can also search for this author in PubMed Google Scholar
Humberto Razente
View author publications
You can also search for this author in PubMed Google Scholar
Maria C. Nardini Barioni
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yasin N. Silva .

Editor information

Editors and Affiliations

University of A Coruña, Coruña, Spain
Oscar Pedreira
Pompeu Fabra University, Barcelona, Spain
Vladimir Estivill-Castro

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Silva, Y.N., Martinez, J., Castro Cea, P., Razente, H., Nardini Barioni, M.C. (2023). Diversity Similarity Join for Big Data. In: Pedreira, O., Estivill-Castro, V. (eds) Similarity Search and Applications. SISAP 2023. Lecture Notes in Computer Science, vol 14289. Springer, Cham. https://doi.org/10.1007/978-3-031-46994-7_20

Download citation

DOI: https://doi.org/10.1007/978-3-031-46994-7_20
Published: 27 October 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-46993-0
Online ISBN: 978-3-031-46994-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics