A pruning strategy to improve pairwise comparison-based near-duplicate detection

Hassanian-esfahani, Roya; Kargar, Mohammad-javad

doi:10.1007/s10115-018-1299-2

A pruning strategy to improve pairwise comparison-based near-duplicate detection

Regular Paper
Published: 03 January 2019

Volume 61, pages 931–963, (2019)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

267 Accesses
1 Citation
Explore all metrics

Abstract

Efficient and accurate near-duplicate detection is a trending topic of research. Complications arise from the great time and space complexities of existing algorithms. This study proposes a novel pruning strategy to improve pairwise comparison-based near-duplicate detection methods. After parsing the documents into punctuation-delimited blocks called chunks, it decides between the categories of “near duplicate,” “non-duplicate” or “suspicious” by applying certain filtering rules. This early decision makes it possible to disregard many of the non-necessary computations—on average 92.95% of them. Then, for the suspicious pairs, common chunks and short chunks are removed and the remaining subsets are reserved for near-duplicate detection. Size of the remaining subsets is on average 4.42% of the original corpus size. Evaluation results show that near-duplicate detection with the proposed strategy in its best configuration (CHT = 8, τ = 0.1) has F-measure = 87.22% (precision = 86.91% and recall = 87.54%). Its F-measure is comparable with the SpotSig method with less execution time. In addition, applying the proposed strategy in a near-duplicate detection process eliminates the need for preprocessing. It is also tunable to achieve the intended levels of near duplication and noise suppression.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 4

Near-Duplicate Document Detection Using Semantic-Based Similarity Measure: A Novel Approach

Efficient Approach for Near Duplicate Document Detection Using Textual and Conceptual Based Techniques

An extended version of sectional MinHash method for near-duplicate detection

Article 20 April 2022

References

Abdel Hamid O, Behzadi B, Christoph S, Henzinger M (2009) Detecting the origin of text segments efficiently‏. In Proceedings of the 18th international conference on World Wide Web
Alonso O, Fetterly D, Manasse M (2013) Duplicate news story detection revisited. In Asia information retrieval symposium. Springer, Berlin, pp 203–214
Bernstein Y, Shokouhi M, Zobel J (2006) Compact features for detection of near-duplicates in distributed retrieval. In International symposium on string processing and information retrieval. Springer, Berlin, pp 110–121
Bhimireddy M, Gandi KP, Hicks R, Veeramachaneni BR (2015) A survey to fix the threshold and implementation for detecting duplicate web documents. All Capstone Projects, Paper 155
Bilenko M, Mooney RJ (2003) Adaptive duplicate detection using learnable string similarity measures. In: Proceedings of the ninth ACM SIGKDD international conference on knowledge discovery and data mining, pp. 39–48
Broder AZ (1997) On the resemblance and containment of documents. In: Proceedings of the international conference on compression and complexity of sequences. IEEE, pp 21–29
Broder AZ (2000) Identifying and filtering near-duplicate documents. In: Annual symposium on combinatorial pattern matching. Springer, Berlin, pp 1–10
Broder AZ, Glassman SC, Manasse MS, Zweig G (1997) Syntactic clustering of the web. J Comput Netw ISDN Syst 29(8):1157–1166
Article Google Scholar
Charikar MS (2002) Similarity estimation techniques from rounding algorithms. In: Proceedings of the thirty-fourth annual ACM symposium on theory of computing. ACM, pp 380–388
Chen Q, Zobel J, Verspoor K (2017) Duplicates, redundancies and inconsistencies in the primary nucleotide databases: a descriptive study. Database 1:baw163. https://doi.org/10.1093/database/baw163
Article Google Scholar
Chowdhury A, Frieder O, Grossman D, McCabe MC (2002) Collection statistics for fast duplicate document detection. ACM Trans Inf Syst (TOIS) 20(2):171–191
Article Google Scholar
Clough PD (2003) Measuring text reuse. Department of Computer Science, University of Sheffield, Sheffield
Google Scholar
Cohen E, Datar M, Fujiwara S, Gionis A, Indyk P, Motwani R et al (2001) Finding interesting associations without support pruning. IEEE Trans Knowl Data Eng 13(1):64–78
Article Google Scholar
Cohen E, Kaplan H (2007) Bottom-k sketches: better and more efficient estimation of aggregates‏. In: ACM SIGMETRICS performance evaluation‏
Conrad JG, Guo XS, Schriber CP (2003) Online duplicate document detection: signature reliability in a dynamic retrieval environment‏. In Proceedings of the twelfth international conference on Information and knowledge management. ACM, pp 443–452
Cooper JW, Coden AR, Brown EW (2002) A novel method for detecting similar documents. In HICSS. Proceedings of the 35th annual Hawaii international conference on system sciences, 2002. IEEE, pp 1153–1159
Dobra A, Garofalakis M, Gehrke J, Rastogi R (2009) Multi-query optimization for sketch-based estimation. Inf Syst 34(2):209–230
Article Google Scholar
Hajishirzi H, Yih W, Kolcz A (2010) Adaptive near-duplicate detection via similarity learning. In: Proceedings of the 33rd international ACM SIGIR conference on research and development in information retrieval, pp 419–426
Har-Peled S, Indyk P, Motwani R (2012) Approximate nearest neighbor: towards removing the curse of dimensionality. Theory Comput 8(1):321–350
Article MathSciNet MATH Google Scholar
Heintze N (1996) Scalable document fingerprinting. In: 1996 USENIX workshop on electronic commerce, vol 3
Hoad TC, Zobel J (2003) Methods for identifying versioned and plagiarized documents. J Am Soc Inf Sci Technol 54(3):203–215
Article Google Scholar
Jaccard P (1901) Distribution de la Flore Alpine: dans le Bassin des dranses et dans quelques régions voisines. Rouge
Jangwon SEO, Croft WB (2008) Local text reuse detection‏. In: Proceedings of the 31st annual international ACM SIGIR conference on research and development in information retrieval. ACM, pp 571–578. http://dl.acm.org/citation.cfm?id=1390432
Ji J, Li J, Yan S, Tian Q, Zhang B (2013) Min-max hash for Jaccard similarity. In: The 13th international conference on data mining (ICDM). IEEE, pp 301–309
Kołcz A, Chowdhury A (2008) Lexicon randomization for near-duplicate detection with I-Match. J Supercomput 45(3):255–276
Article Google Scholar
Kołcz A, Chowdhury A, Alspector J (2004) Improved robustness of signature-based near-replica detection via lexicon randomization. In: Proceedings of the tenth ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 605–610
Leskovec J, Backstrom L, Kleinberg J (2009) Meme-tracking and the dynamics of the news cycle. In: 15th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 497–506
Li P, König C (2010) b-Bit minwise hashing. In: The 19th international conference on World Wide Web (WWW’10). ACM Press, New York, p 671
Li P, Owen A, Zhang C-H (2012) One permutation hashing. In: Pereira F, Burges CJC, Bottou L, Weinberger KQ (eds) Advances in neural information processing systems (Proceeding of the neural information processing systems conference), pp 3113–3121
Lo GS, Dembele S (2015) Probabilistic, statistical and algorithmic aspects of the similarity of texts and application to Gospels comparison. arXiv preprint arXiv:1508.03772
Mitzenmacher M, Pagh R, Pham N (2014) Efficient estimation for high similarities using odd sketches‏. In: Proceedings of the 23rd international World Wide Web Conference Committee (IW3C2)‏
Montanari D, Puglisi PL (2012) Near duplicate document detection for large information flows‏. In: International conference on availability,‏ p 16. http://link.springer.com/chapter/10.1007/978-3-642-32498-7_16
Pamulaparty L, Rao CVG, Rao MS (2014) A near-duplicate detection algorithm to facilitate document clustering. Int J Data Min Knowl Manag Process 4(6):39
Article Google Scholar
Sarawagi S, Kirpal A (2004) Efficient set joins on similarity predicates. In: Proceedings of the 2004 ACM SIGMOD international conference on management of data. ACM, pp 743–754
Schleimer S, Wilkerson DS, Aiken A (2003). Winnowing: local algorithms for document fingerprinting. In: Proceedings of the 2003 ACM SIGMOD international conference on management of data. ACM, pp 76–85
Sun Y, Qin J, Wang W (2013) Near duplicate text detection using frequency-biased signatures. WISE 1:277–291
Google Scholar
Theobald M, Siddharth J, Paepcke A (2008) Spotsigs: robust and efficient near duplicate detection in large web collections. In: Proceedings of the 31st annual international ACM SIGIR conference on research and development in information retrieval. ACM, pp 563–570
Van Bezu R, Borst S, Rijkse R, Verhagen J (2015) Multi-component similarity method for web product duplicate detection‏. In: Proceedings of the 30th annual ACM symposium on applied computing
Vaughan L (2014) Discovering business information from search engine query data. Int J Online Inf Rev 38(4):562–574
Article Google Scholar
Wang J, Chang H (2014) Exploiting near-duplicate relations in organizing news archives. Int J Intell Syst 29(7):597–614
Article Google Scholar
Wang Y, Zeng D, Zheng X, Wang F (2009) Propagation of online news: dynamic patterns. In: IEEE international conference on intelligence and security informatics, ISI’09. IEEE, pp 257–259
Xiao C, Wang W, Lin X, Yu JX, Wang G (2011) Efficient similarity joins for near-duplicate detection. ACM Trans Database Syst (TODS) 36(3):15
Article Google Scholar
Zhang W, Ji J, Zhu J, Li J, Xu H, Zhang B (2016) BitHash: an efficient bitwise Locality Sensitive Hashing method with applications. Int J Knowl Based Syst 97:40–47
Article Google Scholar

Download references

Author information

Authors and Affiliations

Research Institute for Information and Communications Technologies, Academic Center for Education, Culture and Research, Tehran, Iran
Roya Hassanian-esfahani
Department of Computer Engineering, University of Science and Culture, Tehran, Iran
Mohammad-javad Kargar

Authors

Roya Hassanian-esfahani
View author publications
You can also search for this author in PubMed Google Scholar
Mohammad-javad Kargar
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Roya Hassanian-esfahani.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix 1

Distribution frequency of the number of chunks in the documents for different CHTs:

Appendix 2

See Table 5.

Table 5 Effectiveness evaluation by precision, recall and F-measure for different CHT and \( \varvec{\tau} \)

Full size table

Appendix 3 3.1 Setting the similarity threshold

Results of the experiments with the aim of finding the best value for the similarity threshold are provided below. In line with previous studies [32, 37], the results show that selecting τ = 0.6 (between the two turning points of 0.5 and 0.7) would lead to an acceptable performance (Fig. 15).

3.2 k-Shingling on m random Shingles

Several experiments were conducted by having k from 1 to 3 and m from 200 to 500. The results are provided in Fig. 16. As shown by the results, the best F-measure is achieved by having (k = 1 and m = 200) or (k = 1 and m = 300) or (k = 1 and m = 400) or (k = 1 and m = 500). Selecting each of the mentioned settings would result in F-measure = 82.11%.

3.3 k-Shingling on m most frequent Shingles

Several experiments were conducted by having k from 1 to 3 and m from 200 to 500. The results are provided in Fig. 17. As shown by the results, the best F-measure is achieved by having (k = 1 and m = 200) or (k = 1 and m = 300) or (k = 1 and m = 400) or (k = 1 and m = 500). Selecting each of the mentioned settings would result in F-measure = 66.86%.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Hassanian-esfahani, R., Kargar, Mj. A pruning strategy to improve pairwise comparison-based near-duplicate detection. Knowl Inf Syst 61, 931–963 (2019). https://doi.org/10.1007/s10115-018-1299-2

Download citation

Received: 15 September 2017
Revised: 06 October 2018
Accepted: 28 November 2018
Published: 03 January 2019
Issue Date: 01 November 2019
DOI: https://doi.org/10.1007/s10115-018-1299-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A pruning strategy to improve pairwise comparison-based near-duplicate detection

Abstract

Access this article

Similar content being viewed by others

Near-Duplicate Document Detection Using Semantic-Based Similarity Measure: A Novel Approach

Efficient Approach for Near Duplicate Document Detection Using Textual and Conceptual Based Techniques

An extended version of sectional MinHash method for near-duplicate detection

References