An extended version of sectional MinHash method for near-duplicate detection

Shayegan, Mohammad-Javad; Faizollahi-Samarin, Mehdi

doi:10.1007/s11227-022-04447-x

An extended version of sectional MinHash method for near-duplicate detection

Published: 20 April 2022

Volume 78, pages 15638–15662, (2022)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

212 Accesses
1 Citation
Explore all metrics

Abstract

Identifying near-duplicate data can be applied to any type of content and has been widely used for increasing search engines' efficiency, detecting plagiarism or spam, etc. As a near-duplicate detection (NDD) method, sectional MinHash (S-MinHash) estimates the similarity between the text content in high accuracy by considering the section of every document's attributes with similarity estimation. However, due to the addition in computational complexity, it still has some performance issues such as being slow. The proposed sectional Min–Max Hash method aims to reduce the hashing time while preserving and improving the accuracy of detecting near-duplicate documents. We achieved this goal by combining S-MinHash with Min–Max Hash method. The results show that our new method reduces the hashing time and provides more speed due to using half of the random hash functions that S-MinHash needs to build up the signature matrix. Furthermore, we conducted experiments to compare our sectional Min–Max Hash with the baseline methods on the evaluated dataset and confirmed that in terms of the running time and algorithm's precision, the proposed method yields better results than the S-MinHash and other NDD techniques. Also, by assuming that we have two sections, as the best-case performance for sectional algorithms on the evaluated dataset, the error rate reduced significantly in the proposed method, and the F-score reached up to 99%.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Near Duplicate Text Detection Using Frequency-Biased Signatures

Efficient Approach for Near Duplicate Document Detection Using Textual and Conceptual Based Techniques

Near-Duplicate Document Detection Using Semantic-Based Similarity Measure: A Novel Approach

References

Fellah A (2021) All-Three: Near-optimal and domain-independent algorithms for near-duplicate detection. Array 11:100070. https://doi.org/10.1016/j.array.2021.100070
Article Google Scholar
Wang Y, Zeng D, Zheng X, Wang F (2009) Propagation of online news: dynamic patterns. In: 2009 IEEE International Conference on Intelligence and Security Informatics, Richardson, Texas, USA. IEEE, pp 257–259. https://doi.org/10.1109/ISI.2009.5137321
Broder AZ, Eiron N, Fontoura M, Herscovici M, Lempel R, McPherson J, Qi R, Shekita E (2006) Indexing shared content in information retrieval systems. In: International Conference on Extending Database Technology, Munich, Germany. Springer, pp 313–330. https://doi.org/10.1007/11687238_21
Bernstein Y, Shokouhi M, Zobel J (2006) Compact features for detection of near-duplicates in distributed retrieval. In: International Symposium on String Processing and Information Retrieval. Springer, Berlin, Heidelberg, pp 110–121. https://doi.org/10.1007/11880561_10
Broder AZ (2000) Identifying and filtering near-duplicate documents. In: Annual Symposium on Combinatorial Pattern Matching. Springer, pp 1–10. https://doi.org/10.1007/3-540-45123-4_1
Hassanian-esfahani R, Kargar M-j (2018) Sectional minhash for near-duplicate detection. Experts Syst Appl 99:203–212. https://doi.org/10.1016/j.eswa.2018.01.014
Article Google Scholar
Broder AZ (1997) On the resemblance and containment of documents. In: Proceedings. Compression and Complexity of SEQUENCES (Cat. No. 97TB100171). IEEE, pp 21–29. https://doi.org/10.1109/SEQUEN.1997.666900
Cohen E (2016) Min-Hash Sketches: A Brief Survey. Encyclopedia of Algorithms: 1282–1287.
Zhang C, Lin Y, Zhu L, Yuan X, Long J, Huang F (2019) Hierarchical one permutation hashing: efficient multimedia near duplicate detection. Multimed Tools Appl 78(21):30537–30560. https://doi.org/10.1007/s11042-018-6178-z
Article Google Scholar
Xiao Q, Wen L, Zhang Q (2021) Multi-resolution Odd Sketch for Mining Jaccard Similarities between Dynamic Streaming Sets. In: 2021 IEEE 24th International Conference on Computer Supported Cooperative Work in Design (CSCWD), Dalian, China. IEEE, pp 198–203. https://doi.org/10.1109/CSCWD49262.2021.9437641
Li P, Owen A, Zhang C-H (2012) One permutation hashing. In: 2012 26th Annual Conference on Neural Information Processing Systems (NIPS), Lake Tahoe. pp 3113–3121
Cohen E, Datar M, Fujiwara S, Gionis A, Indyk P, Motwani R, Ullman JD (2001) Finding interesting associations without support pruning. IEEE Trans Knowl Data Eng 13(1):64–78. https://doi.org/10.1109/69.908981
Article Google Scholar
Cohen E, Kaplan H (2007) Bottom-k sketches: Better and more efficient estimation of aggregates. In: ACM SIGMETRICS Performance Evaluation Review. vol 1. ACM, pp 353–354. https://doi.org/10.1145/1269899.1254926
Li P, König C (2010) b-Bit minwise hashing. In: Proceedings of the 2010 19th international conference on World wide web, Raleigh, North Carolina, USA. ACM, pp 671–680. https://doi.org/10.1145/1772690.1772759
Tang J, Tian Y, Liu D (2015) Connected bit minwise hashing for large-scale linear svm. In: 2015 IEEE 12th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD), Zhangjiajie, China. IEEE, pp 995–1002. https://doi.org/10.1109/FSKD.2015.7382079
Zhang W, Ji J, Zhu J, Li J, Xu H, Zhang B (2016) BitHash: An efficient bitwise Locality Sensitive Hashing method with applications. Knowl-Based Syst 97:40–47. https://doi.org/10.1016/j.knosys.2016.01.022
Article Google Scholar
Ji J, Li J, Yan S, Tian Q, Zhang B (2013) Min-max hash for jaccard similarity. In: 2013 IEEE 13th International Conference on Data Mining, Dallas, Texas, USA. IEEE, pp 301–309. https://doi.org/10.1109/ICDM.2013.119
Ertl OJapa (2017) Superminhash-A new minwise hashing algorithm for jaccard similarity estimation. arXiv preprint arXiv:1706.05698
Mitzenmacher M, Pagh R, Pham N (2014) Efficient estimation for high similarities using odd sketches. In: Proceedings of the 2014 23rd international conference on World wide web, Seoul, Korea. ACM, pp 109–118. https://doi.org/10.1145/2566486.2568017
Jia P, Wang P, Tao J, Guan X (2019) A fast sketch method for mining user similarities over fully dynamic graph streams. In: 2019 IEEE 35th International Conference on Data Engineering (ICDE), Macao, China. IEEE, pp 1682–1685. https://doi.org/10.1109/ICDE.2019.00172
Wang P, Qi Y, Zhang Y, Zhai Q, Wang C, Lui JC, Guan X (2019) A memory-efficient sketch method for estimating high similarities in streaming sets. In: Proceedings of the 2019 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Anchorage, AK, USA. ACM, pp 25–33. https://doi.org/10.1145/3292500.3330825
Pamulaparty L, Rao CG, Rao MS (2015) XNDDF: Towards a framework for flexible near-duplicate document detection using supervised and unsupervised learning. Proc Comput Sci 48:228–235. https://doi.org/10.1016/j.procs.2015.04.175
Article Google Scholar
Akram J, Mumtaz M, Luo P (2020) IBFET: Index‐based features extraction technique for scalable code clone detection at file level granularity. Softw Pract Exp 50 (1):22–46. https://doi.org/10.1002/spe.2759
Yusof N, Ismail A, Majid NAA (2019) A hybrid model for near-duplicate image detection in mapreduce environment. TEM J 8 (4): 1252–1258. https://doi.org/10.14569/IJACSA.2019.0101248
HaCohen-Kerner Y, Tayeb A (2017) Rapid detection of similar peer-reviewed scientific papers via constant number of randomized fingerprints. Inf Process Manag 53(1):70–86. https://doi.org/10.1016/j.ipm.2016.06.007
Article Google Scholar

Download references

Funding

This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

Author information

Authors and Affiliations

Department of Computer Engineering, University of Science and Culture, Bahar st, Shahid Qamushi st, Ashrafi Esfahani Bulvar, Tehran, Iran
Mohammad-Javad Shayegan & Mehdi Faizollahi-Samarin

Authors

Mohammad-Javad Shayegan
View author publications
You can also search for this author in PubMed Google Scholar
Mehdi Faizollahi-Samarin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mohammad-Javad Shayegan.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Shayegan, MJ., Faizollahi-Samarin, M. An extended version of sectional MinHash method for near-duplicate detection. J Supercomput 78, 15638–15662 (2022). https://doi.org/10.1007/s11227-022-04447-x

Download citation

Accepted: 12 March 2022
Published: 20 April 2022
Issue Date: September 2022
DOI: https://doi.org/10.1007/s11227-022-04447-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An extended version of sectional MinHash method for near-duplicate detection

Abstract

Access this article

Similar content being viewed by others

Near Duplicate Text Detection Using Frequency-Biased Signatures

Efficient Approach for Near Duplicate Document Detection Using Textual and Conceptual Based Techniques

Near-Duplicate Document Detection Using Semantic-Based Similarity Measure: A Novel Approach

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

An extended version of sectional MinHash method for near-duplicate detection

Abstract

Access this article

Similar content being viewed by others

Near Duplicate Text Detection Using Frequency-Biased Signatures

Efficient Approach for Near Duplicate Document Detection Using Textual and Conceptual Based Techniques

Near-Duplicate Document Detection Using Semantic-Based Similarity Measure: A Novel Approach

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation