Skip to main content
Log in

An extended version of sectional MinHash method for near-duplicate detection

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

Identifying near-duplicate data can be applied to any type of content and has been widely used for increasing search engines' efficiency, detecting plagiarism or spam, etc. As a near-duplicate detection (NDD) method, sectional MinHash (S-MinHash) estimates the similarity between the text content in high accuracy by considering the section of every document's attributes with similarity estimation. However, due to the addition in computational complexity, it still has some performance issues such as being slow. The proposed sectional Min–Max Hash method aims to reduce the hashing time while preserving and improving the accuracy of detecting near-duplicate documents. We achieved this goal by combining S-MinHash with Min–Max Hash method. The results show that our new method reduces the hashing time and provides more speed due to using half of the random hash functions that S-MinHash needs to build up the signature matrix. Furthermore, we conducted experiments to compare our sectional Min–Max Hash with the baseline methods on the evaluated dataset and confirmed that in terms of the running time and algorithm's precision, the proposed method yields better results than the S-MinHash and other NDD techniques. Also, by assuming that we have two sections, as the best-case performance for sectional algorithms on the evaluated dataset, the error rate reduced significantly in the proposed method, and the F-score reached up to 99%.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

References

  1. Fellah A (2021) All-Three: Near-optimal and domain-independent algorithms for near-duplicate detection. Array 11:100070. https://doi.org/10.1016/j.array.2021.100070

    Article  Google Scholar 

  2. Wang Y, Zeng D, Zheng X, Wang F (2009) Propagation of online news: dynamic patterns. In: 2009 IEEE International Conference on Intelligence and Security Informatics, Richardson, Texas, USA. IEEE, pp 257–259. https://doi.org/10.1109/ISI.2009.5137321

  3. Broder AZ, Eiron N, Fontoura M, Herscovici M, Lempel R, McPherson J, Qi R, Shekita E (2006) Indexing shared content in information retrieval systems. In: International Conference on Extending Database Technology, Munich, Germany. Springer, pp 313–330. https://doi.org/10.1007/11687238_21

  4. Bernstein Y, Shokouhi M, Zobel J (2006) Compact features for detection of near-duplicates in distributed retrieval. In: International Symposium on String Processing and Information Retrieval. Springer, Berlin, Heidelberg, pp 110–121. https://doi.org/10.1007/11880561_10

  5. Broder AZ (2000) Identifying and filtering near-duplicate documents. In: Annual Symposium on Combinatorial Pattern Matching. Springer, pp 1–10. https://doi.org/10.1007/3-540-45123-4_1

  6. Hassanian-esfahani R, Kargar M-j (2018) Sectional minhash for near-duplicate detection. Experts Syst Appl 99:203–212. https://doi.org/10.1016/j.eswa.2018.01.014

    Article  Google Scholar 

  7. Broder AZ (1997) On the resemblance and containment of documents. In: Proceedings. Compression and Complexity of SEQUENCES (Cat. No. 97TB100171). IEEE, pp 21–29. https://doi.org/10.1109/SEQUEN.1997.666900

  8. Cohen E (2016) Min-Hash Sketches: A Brief Survey. Encyclopedia of Algorithms: 1282–1287.

  9. Zhang C, Lin Y, Zhu L, Yuan X, Long J, Huang F (2019) Hierarchical one permutation hashing: efficient multimedia near duplicate detection. Multimed Tools Appl 78(21):30537–30560. https://doi.org/10.1007/s11042-018-6178-z

    Article  Google Scholar 

  10. Xiao Q, Wen L, Zhang Q (2021) Multi-resolution Odd Sketch for Mining Jaccard Similarities between Dynamic Streaming Sets. In: 2021 IEEE 24th International Conference on Computer Supported Cooperative Work in Design (CSCWD), Dalian, China. IEEE, pp 198–203. https://doi.org/10.1109/CSCWD49262.2021.9437641

  11. Li P, Owen A, Zhang C-H (2012) One permutation hashing. In: 2012 26th Annual Conference on Neural Information Processing Systems (NIPS), Lake Tahoe. pp 3113–3121

  12. Cohen E, Datar M, Fujiwara S, Gionis A, Indyk P, Motwani R, Ullman JD (2001) Finding interesting associations without support pruning. IEEE Trans Knowl Data Eng 13(1):64–78. https://doi.org/10.1109/69.908981

    Article  Google Scholar 

  13. Cohen E, Kaplan H (2007) Bottom-k sketches: Better and more efficient estimation of aggregates. In: ACM SIGMETRICS Performance Evaluation Review. vol 1. ACM, pp 353–354. https://doi.org/10.1145/1269899.1254926

  14. Li P, König C (2010) b-Bit minwise hashing. In: Proceedings of the 2010 19th international conference on World wide web, Raleigh, North Carolina, USA. ACM, pp 671–680. https://doi.org/10.1145/1772690.1772759

  15. Tang J, Tian Y, Liu D (2015) Connected bit minwise hashing for large-scale linear svm. In: 2015 IEEE 12th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD), Zhangjiajie, China. IEEE, pp 995–1002. https://doi.org/10.1109/FSKD.2015.7382079

  16. Zhang W, Ji J, Zhu J, Li J, Xu H, Zhang B (2016) BitHash: An efficient bitwise Locality Sensitive Hashing method with applications. Knowl-Based Syst 97:40–47. https://doi.org/10.1016/j.knosys.2016.01.022

    Article  Google Scholar 

  17. Ji J, Li J, Yan S, Tian Q, Zhang B (2013) Min-max hash for jaccard similarity. In: 2013 IEEE 13th International Conference on Data Mining, Dallas, Texas, USA. IEEE, pp 301–309. https://doi.org/10.1109/ICDM.2013.119

  18. Ertl OJapa (2017) Superminhash-A new minwise hashing algorithm for jaccard similarity estimation. arXiv preprint arXiv:1706.05698

  19. Mitzenmacher M, Pagh R, Pham N (2014) Efficient estimation for high similarities using odd sketches. In: Proceedings of the 2014 23rd international conference on World wide web, Seoul, Korea. ACM, pp 109–118. https://doi.org/10.1145/2566486.2568017

  20. Jia P, Wang P, Tao J, Guan X (2019) A fast sketch method for mining user similarities over fully dynamic graph streams. In: 2019 IEEE 35th International Conference on Data Engineering (ICDE), Macao, China. IEEE, pp 1682–1685. https://doi.org/10.1109/ICDE.2019.00172

  21. Wang P, Qi Y, Zhang Y, Zhai Q, Wang C, Lui JC, Guan X (2019) A memory-efficient sketch method for estimating high similarities in streaming sets. In: Proceedings of the 2019 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Anchorage, AK, USA. ACM, pp 25–33. https://doi.org/10.1145/3292500.3330825

  22. Pamulaparty L, Rao CG, Rao MS (2015) XNDDF: Towards a framework for flexible near-duplicate document detection using supervised and unsupervised learning. Proc Comput Sci 48:228–235. https://doi.org/10.1016/j.procs.2015.04.175

    Article  Google Scholar 

  23. Akram J, Mumtaz M, Luo P (2020) IBFET: Index‐based features extraction technique for scalable code clone detection at file level granularity. Softw Pract Exp 50 (1):22–46. https://doi.org/10.1002/spe.2759

  24. Yusof N, Ismail A, Majid NAA (2019) A hybrid model for near-duplicate image detection in mapreduce environment. TEM J 8 (4): 1252–1258. https://doi.org/10.14569/IJACSA.2019.0101248

  25. HaCohen-Kerner Y, Tayeb A (2017) Rapid detection of similar peer-reviewed scientific papers via constant number of randomized fingerprints. Inf Process Manag 53(1):70–86. https://doi.org/10.1016/j.ipm.2016.06.007

    Article  Google Scholar 

Download references

Funding

This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mohammad-Javad Shayegan.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Shayegan, MJ., Faizollahi-Samarin, M. An extended version of sectional MinHash method for near-duplicate detection. J Supercomput 78, 15638–15662 (2022). https://doi.org/10.1007/s11227-022-04447-x

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-022-04447-x

Keywords

Navigation