A Maximum Dimension Partitioning Approach for Efficiently Finding All Similar Pairs

Koh, Jia-Ling; Peng, Shao-Chun

doi:10.1007/978-3-319-43946-4_11

Jia-Ling Koh¹⁵ &
Shao-Chun Peng¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9829))

Included in the following conference series:

International Conference on Big Data Analytics and Knowledge Discovery

1169 Accesses

Abstract

For solving the All Pair Similarity Search (APSS) problem efficiently, this paper provides a maximum dimension partitioning approach to effectively filter non-similar pairs in an early stage. At first, for each data point, the dimension with the maximum value is used to decide the corresponding segment of data partition. An adjusting method is designed to balance the number of elements in each data segment. The similar pairs consist of inter-segment similar pairs and intra-segment similar pairs, where most effort of computing APSS comes from the computation of finding inter-segment similar pairs. For speeding up the computation, a pilot-vector is used to represent each segment for estimating the upper bound of similarity between each segment pair. Only the segment pairs, whose upper bounds of similarity are larger than the given similarity threshold, need to generate the inter-segment data pairs as candidates. Moreover, based on the proposed partitioning method, we designed a MapReduce framework to solve the APSS problem in parallel. The performance evaluation results show the proposed method provides better pruning effectiveness on non-similar data pairs than the related works. Moreover, the proposed partition-based method can properly fit into the MapReduce programming scheme to effectively reduce the response time of solving the APSS problem.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

An efficient MapReduce algorithm for similarity join in metric spaces

Article 06 February 2016

Large-Scale Similarity Join with Edit-Distance Constraints

Multi-match Segments Similarity Join Algorithm Based on MapReduce

References

Alabduljalil, M., Tang, X., Yang, T.: Optimizing parallel algorithms for all pairs similarity search. In: Proceedings of the 6th ACM International Conference on Web Search and Data Mining (WSDM) (2013)
Google Scholar
Anastasiu, D.C., Karypis, G.: L2AP: fast cosine similarity search with prefix L-2 norm bounds. In: Proceedings of the 30th IEEE International Conference on Data Engineering (ICDE) (2014)
Google Scholar
Arasu, A., Ganti, V., Kaushik, R.: Efficient exact set-similarity joins. In: Proceedings of the 32nd International Conference on Very Large Data Bases (VLDB) (2006)
Google Scholar
Awekar, A., Samatova1, N.F., Breimyer, P.: Incremental all pairs similarity search for varying similarity thresholds with reduced I/O overhead. In: Proceedings the 3rd SNA-KDD Workshop (2009)
Google Scholar
Bayardo, R.J., Ma, Y., Srikant, R.: Scaling up all pairs similarity search. In: Proceedings of the 16th International Conference on World Wide Web (WWW) (2007)
Google Scholar
Chaudhuri, S., Ganti, V., Kaushik, R.: A primitive operator for similarity joins in data cleaning. In: Proceedings of the 24th IEEE International Conference on Data Engineering (ICDE) (2006)
Google Scholar
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In Proceedings of OSDI (2004)
Google Scholar
Francisci, G.D., Lucchese, C., Baraglia, R.: Scaling out all pairs similarity search with MapReduce. In: Proceedings of 8th Workshop on Large-Scale Distributed Systems for Information Retrieval (LSDS-IR) (2010)
Google Scholar
Gionis, A., Indyky, P.: Similarity search in high dimensions via hashing. In: Proceedings of 25th International Conference on Very Large Data Bases (VLDB) (1999)
Google Scholar
Lin, J.: Brute force and indexed approaches to pairwise document similarity comparisons with MapReduce. In: Proceedings the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR) (2009)
Google Scholar
Metwally, A., Faloutsos, C.: V-SMART-join: a scalable MapReduce framework for all-pair similarity joins of multisets and vectors. Proc. VLDB Endowment 5(8), 704–715 (2012)
Article Google Scholar
Ribeiro, L.A., Härder, T.: Efficient set similarity joins using min-prefixes. In: Grundspenkis, J., Morzy, T., Vossen, G. (eds.) ADBIS 2009. LNCS, vol. 5739, pp. 88–102. Springer, Heidelberg (2009)
Chapter Google Scholar
Satuluri, V.: Bayesian locality sensitive hashing for fast similarity search. Proc. VLDB Endowment 5(5), 430–441 (2012)
Article Google Scholar
Tang, X., Alabduljalil, M., Jin, X., Yang, T.: Load balancing for partition-based similarity search. In: Proceedings of the 37th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR) (2014)
Google Scholar
Wang, Y., Metwally, A., Parthasarathy, S.: Scalable all-pairs similarity search in metric spaces. In: Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD) (2013)
Google Scholar
Wang, J., Li, G., Feng, J.: Can we beat the prefix filtering: an adaptive framework for similarity join and search. In: Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data (SIGMOD) (2012)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Information Science and Computer Engineering, National Taiwan Normal University, Taipei, 106, Taiwan, ROC
Jia-Ling Koh & Shao-Chun Peng

Authors

Jia-Ling Koh
View author publications
You can also search for this author in PubMed Google Scholar
Shao-Chun Peng
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jia-Ling Koh .

Editor information

Editors and Affiliations

University of Science and Technology , Rolla, Missouri, USA
Sanjay Madria
Osaka University , Osaka, Japan
Takahiro Hara

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Koh, JL., Peng, SC. (2016). A Maximum Dimension Partitioning Approach for Efficiently Finding All Similar Pairs. In: Madria, S., Hara, T. (eds) Big Data Analytics and Knowledge Discovery. DaWaK 2016. Lecture Notes in Computer Science(), vol 9829. Springer, Cham. https://doi.org/10.1007/978-3-319-43946-4_11

Download citation

DOI: https://doi.org/10.1007/978-3-319-43946-4_11
Published: 06 August 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-43945-7
Online ISBN: 978-3-319-43946-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

A Maximum Dimension Partitioning Approach for Efficiently Finding All Similar Pairs

Abstract

Access this chapter

Similar content being viewed by others

An efficient MapReduce algorithm for similarity join in metric spaces

Large-Scale Similarity Join with Edit-Distance Constraints

Multi-match Segments Similarity Join Algorithm Based on MapReduce

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

A Maximum Dimension Partitioning Approach for Efficiently Finding All Similar Pairs

Abstract

Access this chapter

Similar content being viewed by others

An efficient MapReduce algorithm for similarity join in metric spaces

Large-Scale Similarity Join with Edit-Distance Constraints

Multi-match Segments Similarity Join Algorithm Based on MapReduce

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation