Abstract
Web mining is a part of data mining in which the web consists of enormous amount of data. The search engines faces large amount of problems due to the presence of Near duplicate documents in web which leads to irrelevant answers. The performance and reliability of search engines are critically affecting since the near duplicate documents present in web. For detection of near duplicate web documents two attempts are found in the literature. The former considered domain and size of the document and the later considered text and image as the search parameters. This article proposes a novel approach combining the parameters such as text, image, size and domain of the document to detect near duplicate documents. The approach extracts the keywords and images of the crawled document and compares them with the existing documents for similarity measure. If the similarity score measure value is less than 19.5 and image comparison value is greater than 70%, then it is detected as near duplicate document.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Liu L, Lu Y, Suen CY (2015) Variable-length signature for near-duplicate image matching. IEEE Trans Image Process 24(4):1282–1296
Landge A, Mane P (2016) Near duplicate image matching techniques. In: 2016 international conference on information communication and embedded systems (ICICES)
Qiu J, Zeng Q (2010) Detection and optimized disposal of near-duplicate pages. In: 2010 2nd international conference on future computer and communication
Arun PR, Sumesh MS (2015) Near-duplicate web page detection by enhanced TDW and simHash technique. In: 2015 international conference on computing and network communications (CoCoNet’15), 16–19 December 2015, Trivandrum, India
Naseem R, Anees S, Muneer K, Syed Farook K (2013) Near duplicate web page detection with analytic feature weighting. In: 2013 third international conference on advances in computing and communications
Hu Y, Li M, Yu N (2018) Efficient near-duplicate image detection by learning from examples. In: 2008 IEEE international conference on multimedia expo
Yıldız B, Demirci MF (2016) Distinctive interest point selection for efficient near-duplicate image retrieval. In: 2016 IEEE southwest symposium on image analysis and interpretation (SSIAI)
Duan M, Xie X, Wu X, Ma W-Y (2008) Visual pattern weighting for near-duplicate image retrieval. In: 2008 IEEE international conference on multimedia and expo
Wu L, Liu J, Yu N, Li M (2008) Query oriented subspace shifting for near-duplicate image detection. In: 2008 IEEE international conference on multimedia and expo
Sun Z, Wang C, Jia K (2011) Near-duplicate video clips detection with motion based video fingerprinting. In: 2011 4th international congress on image and signal processing
Narayana VA, Gaddameedhi S, Koppula VK, Raju KS (2018) Framework for proficient proof of identity of duplicate and near-duplicate images and image distances using high-disguisable image fragment. In: 5th IEEE international conference on parallel, distributed and grid computing (PDGC-2018), 20–22 December 2018, Solan, India
He Y, Gao J (2018) Detecting short near-duplicates with semantic relations. In: 018 IEEE 9th international conference on software engineering and service science (ICSESS)
Du Q, Liu W, Li G, Tang Y (2012) Near duplicate detection using MapReduce. In: 2012 2nd international conference on computer science and network technology (ICCSNT)
Luan X, Xie Y, Guo Y, He J, Zhang L, Zhang X (2017) A fast near-duplicate keyframe detection method based on local features. In: 017 17th IEEE international conference on communication technology
Chang T-Y, Tai S-C, Lin G-S (2015) A near-duplicate video retrieval method based on zernike moments. In: Proceedings of APSIPA annual summit and conference
Chou C-L, Chen H-T, Lee S-Y (2015) Pattern-based near duplicate video retrieval and localization on web-scale videos
Harbin, P.R. China (2012) Book retrieval based on near-duplicate image matching. In: 2012 9th international conference on fuzzy systems and knowledge discovery (FSKD 2012)
Vidyulatha M, Narayana VA (2018) Detection of near duplicate documents by considering the domain to which the documents belongs. Int J Emerging Trends Technol Sci 9(4):629–639. (ISSN: 2348–0246 (online))
Ide I, Shamot Y (2010) Classification of Near duplicate video segments based on their appearance patterns. In: 2010 international conference on pattern recognition
Uysal MS, Beecks C, Sabinasz D, Seidl T (2015) Effective content-based near-duplicate video detection. In: 2015 IEEE international symposium on multimedia
Niu X, Xie Y, Li C, Luan X (2016) Near-duplicate keyframe detection based on gray-scale pyramid. In: 2016 IEEE international conference on signal and image processing
Manku GS, Jain A, Sarma AD (2007) Detecting near-duplicates for web crawling. In: Proceedings of the 16th international conference on world wide web, pp 141–150
Narayana VA, Premchand P, Govardhan A (2009) A novel and efficient approach for near duplicate page detection in web crawling. https://doi.org/10.1109/iadcc.2009.4809238
Narayana VA, Premchand P, Govardhan A (2012) Performance and comparative analysis of the two contrary approaches for detecting near duplicate web documents in web crawling. Int J Comput Appl 59(3):22–29
Pi B, Fu S, Wang W, HanS (2009) SimHash-based effective and efficient detecting of near-duplicate short messages
Gong C, Huang Y, Cheng X, Bai S (2008) Detecting near-duplicates in large-scale short text databases
Roul RK, Mittal S, Joshi P (2014) Efficient approach for near duplicate document detection using textual and conceptual based techniques. In: Kumar Kundu M, Mohapatra D, Konar A, Chakraborty A (eds) Advanced Computing, Networking and Informatics - Volume 1. Smart Innovation, Systems and Technologies, vol 27. Springer, Cham
Prasanna Kumar J, Govindarajulu P (2013) Near-duplicate web page detection: “an efficient approach using clustering, sentence feature and fingerprinting. Int J Comput Intell Syst 6(1):1–13
Sravanthi G, Narayana VA (2018) An efficient approach for detection of near replicas documents by considering both the text & the images. J Adv Res 10(03-Special Issue):417–424
Zaheer MD, Narayana VA (2019) A strategy for near-deduplication web document considering both domain &size of the document. Int J Comput Appl (2278–3075) 8(4S2)
XNDDF (2015) Towards a framework for flexible near-duplicate document detection using supervised and unsupervised learning. In: International conference on intelligent computing, communication, & convergence (ICCC-2015)
Ho P-T, Kim S-R (2014) Fingerprint-based near-duplicate document detection with applications to SNS spam detection. Int J Distrib Sens Netw 2014, 8. Article ID 612970
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Bhavani, M., Narayana, V.A., Sreevani, G. (2021). A Novel Approach for Detecting Near-Duplicate Web Documents by Considering Images, Text, Size of the Document and Domain. In: Kumar, A., Mozar, S. (eds) ICCCE 2020. Lecture Notes in Electrical Engineering, vol 698. Springer, Singapore. https://doi.org/10.1007/978-981-15-7961-5_123
Download citation
DOI: https://doi.org/10.1007/978-981-15-7961-5_123
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-15-7960-8
Online ISBN: 978-981-15-7961-5
eBook Packages: EngineeringEngineering (R0)