A Novel Approach for Detecting Near-Duplicate Web Documents by Considering Images, Text, Size of the Document and Domain

Bhavani, M.; Narayana, V. A.; Sreevani, Gaddameedi

doi:10.1007/978-981-15-7961-5_123

M. Bhavani³⁶,
V. A. Narayana³⁶ &
Gaddameedi Sreevani³⁶

Part of the book series: Lecture Notes in Electrical Engineering ((LNEE,volume 698))

1603 Accesses
5 Citations

Abstract

Web mining is a part of data mining in which the web consists of enormous amount of data. The search engines faces large amount of problems due to the presence of Near duplicate documents in web which leads to irrelevant answers. The performance and reliability of search engines are critically affecting since the near duplicate documents present in web. For detection of near duplicate web documents two attempts are found in the literature. The former considered domain and size of the document and the later considered text and image as the search parameters. This article proposes a novel approach combining the parameters such as text, image, size and domain of the document to detect near duplicate documents. The approach extracts the keywords and images of the crawled document and compares them with the existing documents for similarity measure. If the similarity score measure value is less than 19.5 and image comparison value is greater than 70%, then it is detected as near duplicate document.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Hardcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Liu L, Lu Y, Suen CY (2015) Variable-length signature for near-duplicate image matching. IEEE Trans Image Process 24(4):1282–1296
Article MathSciNet MATH Google Scholar
Landge A, Mane P (2016) Near duplicate image matching techniques. In: 2016 international conference on information communication and embedded systems (ICICES)
Google Scholar
Qiu J, Zeng Q (2010) Detection and optimized disposal of near-duplicate pages. In: 2010 2nd international conference on future computer and communication
Google Scholar
Arun PR, Sumesh MS (2015) Near-duplicate web page detection by enhanced TDW and simHash technique. In: 2015 international conference on computing and network communications (CoCoNet’15), 16–19 December 2015, Trivandrum, India
Google Scholar
Naseem R, Anees S, Muneer K, Syed Farook K (2013) Near duplicate web page detection with analytic feature weighting. In: 2013 third international conference on advances in computing and communications
Google Scholar
Hu Y, Li M, Yu N (2018) Efficient near-duplicate image detection by learning from examples. In: 2008 IEEE international conference on multimedia expo
Google Scholar
Yıldız B, Demirci MF (2016) Distinctive interest point selection for efficient near-duplicate image retrieval. In: 2016 IEEE southwest symposium on image analysis and interpretation (SSIAI)
Google Scholar
Duan M, Xie X, Wu X, Ma W-Y (2008) Visual pattern weighting for near-duplicate image retrieval. In: 2008 IEEE international conference on multimedia and expo
Google Scholar
Wu L, Liu J, Yu N, Li M (2008) Query oriented subspace shifting for near-duplicate image detection. In: 2008 IEEE international conference on multimedia and expo
Google Scholar
Sun Z, Wang C, Jia K (2011) Near-duplicate video clips detection with motion based video fingerprinting. In: 2011 4th international congress on image and signal processing
Google Scholar
Narayana VA, Gaddameedhi S, Koppula VK, Raju KS (2018) Framework for proficient proof of identity of duplicate and near-duplicate images and image distances using high-disguisable image fragment. In: 5th IEEE international conference on parallel, distributed and grid computing (PDGC-2018), 20–22 December 2018, Solan, India
Google Scholar
He Y, Gao J (2018) Detecting short near-duplicates with semantic relations. In: 018 IEEE 9th international conference on software engineering and service science (ICSESS)
Google Scholar
Du Q, Liu W, Li G, Tang Y (2012) Near duplicate detection using MapReduce. In: 2012 2nd international conference on computer science and network technology (ICCSNT)
Google Scholar
Luan X, Xie Y, Guo Y, He J, Zhang L, Zhang X (2017) A fast near-duplicate keyframe detection method based on local features. In: 017 17th IEEE international conference on communication technology
Google Scholar
Chang T-Y, Tai S-C, Lin G-S (2015) A near-duplicate video retrieval method based on zernike moments. In: Proceedings of APSIPA annual summit and conference
Google Scholar
Chou C-L, Chen H-T, Lee S-Y (2015) Pattern-based near duplicate video retrieval and localization on web-scale videos
Google Scholar
Harbin, P.R. China (2012) Book retrieval based on near-duplicate image matching. In: 2012 9th international conference on fuzzy systems and knowledge discovery (FSKD 2012)
Google Scholar
Vidyulatha M, Narayana VA (2018) Detection of near duplicate documents by considering the domain to which the documents belongs. Int J Emerging Trends Technol Sci 9(4):629–639. (ISSN: 2348–0246 (online))
Google Scholar
Ide I, Shamot Y (2010) Classification of Near duplicate video segments based on their appearance patterns. In: 2010 international conference on pattern recognition
Google Scholar
Uysal MS, Beecks C, Sabinasz D, Seidl T (2015) Effective content-based near-duplicate video detection. In: 2015 IEEE international symposium on multimedia
Google Scholar
Niu X, Xie Y, Li C, Luan X (2016) Near-duplicate keyframe detection based on gray-scale pyramid. In: 2016 IEEE international conference on signal and image processing
Google Scholar
Manku GS, Jain A, Sarma AD (2007) Detecting near-duplicates for web crawling. In: Proceedings of the 16th international conference on world wide web, pp 141–150
Google Scholar
Narayana VA, Premchand P, Govardhan A (2009) A novel and efficient approach for near duplicate page detection in web crawling. https://doi.org/10.1109/iadcc.2009.4809238
Narayana VA, Premchand P, Govardhan A (2012) Performance and comparative analysis of the two contrary approaches for detecting near duplicate web documents in web crawling. Int J Comput Appl 59(3):22–29
Google Scholar
Pi B, Fu S, Wang W, HanS (2009) SimHash-based effective and efficient detecting of near-duplicate short messages
Google Scholar
Gong C, Huang Y, Cheng X, Bai S (2008) Detecting near-duplicates in large-scale short text databases
Google Scholar
Roul RK, Mittal S, Joshi P (2014) Efficient approach for near duplicate document detection using textual and conceptual based techniques. In: Kumar Kundu M, Mohapatra D, Konar A, Chakraborty A (eds) Advanced Computing, Networking and Informatics - Volume 1. Smart Innovation, Systems and Technologies, vol 27. Springer, Cham
Google Scholar
Prasanna Kumar J, Govindarajulu P (2013) Near-duplicate web page detection: “an efficient approach using clustering, sentence feature and fingerprinting. Int J Comput Intell Syst 6(1):1–13
Article Google Scholar
Sravanthi G, Narayana VA (2018) An efficient approach for detection of near replicas documents by considering both the text & the images. J Adv Res 10(03-Special Issue):417–424
Google Scholar
Zaheer MD, Narayana VA (2019) A strategy for near-deduplication web document considering both domain &size of the document. Int J Comput Appl (2278–3075) 8(4S2)
Google Scholar
XNDDF (2015) Towards a framework for flexible near-duplicate document detection using supervised and unsupervised learning. In: International conference on intelligent computing, communication, & convergence (ICCC-2015)
Google Scholar
Ho P-T, Kim S-R (2014) Fingerprint-based near-duplicate document detection with applications to SNS spam detection. Int J Distrib Sens Netw 2014, 8. Article ID 612970
Google Scholar

Download references

Author information

Authors and Affiliations

Dept of CSE, CMR College of Engineering and Technology, Hyderabad, India
M. Bhavani, V. A. Narayana & Gaddameedi Sreevani

Authors

M. Bhavani
View author publications
You can also search for this author in PubMed Google Scholar
V. A. Narayana
View author publications
You can also search for this author in PubMed Google Scholar
Gaddameedi Sreevani
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to M. Bhavani .

Editor information

Editors and Affiliations

BioAxis DNA Research Centre (P) Ltd., Hyderabad, India
Amit Kumar
Dynexsys, Sydney, NSW, Australia
Stefan Mozar

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Bhavani, M., Narayana, V.A., Sreevani, G. (2021). A Novel Approach for Detecting Near-Duplicate Web Documents by Considering Images, Text, Size of the Document and Domain. In: Kumar, A., Mozar, S. (eds) ICCCE 2020. Lecture Notes in Electrical Engineering, vol 698. Springer, Singapore. https://doi.org/10.1007/978-981-15-7961-5_123

Download citation

DOI: https://doi.org/10.1007/978-981-15-7961-5_123
Published: 12 October 2020
Publisher Name: Springer, Singapore
Print ISBN: 978-981-15-7960-8
Online ISBN: 978-981-15-7961-5
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics