Efficient Approach for Near Duplicate Document Detection Using Textual and Conceptual Based Techniques

Part of the Smart Innovation, Systems and Technologies book series (SIST, volume 27)

Abstract

With the rapid development and usage of World Wide Web, there are a huge number of duplicate web pages. To help the search engine for providing results free from duplicates, detection and elimination of duplicates is required. The proposed approach combines the strength of some "state of the art" duplicate detection algorithms like Shingling and Simhash to efficiently detect and eliminate near duplicate web pages while considering some important factors like word order. In addition, it employs Latent Semantic Indexing (LSI) to detect conceptually similar documents which are often not detected by textual based duplicate detection techniques like Shingling and Simhash. The approach utilizes hamming distance and cosine similarity (for textual and conceptual duplicate detection respectively) between two documents as their similarity measure. For performance measurement, the F-measure of the proposed approach is compared with the traditional Simhash technique. Experimental results show that our approach can outperform the traditional Simhash.

Keywords

F-measure LSI Shingling Simhash TF-IDF 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Broder, A.Z.: Identifying and filtering near-duplicate documents. In: Giancarlo, R., Sankoff, D. (eds.) CPM 2000. LNCS, vol. 1848, pp. 1–10. Springer, Heidelberg (2000)CrossRefGoogle Scholar
  2. 2.
    Charikar, M.S.: Similarity estimation techniques from rounding algorithms. In: STOC 2002: Proceedings of the 34th Annual ACM Symposium on Theory of Computing, pp. 380–388. ACM, New York (2002)Google Scholar
  3. 3.
    Henzinger, M.: Finding near-duplicate web pages: a large-scale evaluation of algorithms. In: SIGIR 2006: Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 284–291. ACM, New York (2006)CrossRefGoogle Scholar
  4. 4.
    Manku, G.S., Jain, A., Sharma, A.D.: Detecting Near-duplicates for web crawling. In: WWW / Track: Data Mining (2007)Google Scholar
  5. 5.
    Sun, Y., Qin, J., Wang, W.: Near Duplicate Text Detection Using Frequency-Biased Signatures. In: Lin, X., Manolopoulos, Y., Srivastava, D., Huang, G. (eds.) WISE 2013, Part I. LNCS, vol. 8180, pp. 277–291. Springer, Heidelberg (2013)CrossRefGoogle Scholar
  6. 6.
    Pi, B., Fu, S., Wang, W., Han, S.: SimHash-based Effective and Efficient Detecting of Near-Duplicate Short Messages. In: Proceedings of the 2nd Symposium International Computer Science and Computational TechnologyGoogle Scholar
  7. 7.
    Zhang, Y.H., Zhang, F.: Research on New Algorithm of Topic-Oriented Crawler and Duplicated Web Pages Detection. In: Intelligent Computing Theories and Applications 8th International Conference, ICIC, Huangshan, China, pp. 25–29 (2012)Google Scholar
  8. 8.
    Figuerola, C.G., Díaz, R.G., Berrocal, J.L.A., Rodríguez, A.F.Z.: Web Document Duplicate Detection using Fuzzy Hashing. In: Trends in Practical Applications of Agents and Multiagent Systems, 9th International Conference on Practical Applications of Agents and Multiagent Systems, vol. 90, pp. 117–125 (2011)Google Scholar
  9. 9.
    Tan, P.N., Kumar, V., Steinbach, M.: Introduction to Data Mining. PearsonGoogle Scholar
  10. 10.
    Theobald, M., Siddharth, J., Paepcke, A.: SpotSigs: Robust and Efficient Near Duplicate Detection. In: Large Web Collections in (SIGIR 2008), pp. 20–24 (2008)Google Scholar
  11. 11.
    Rehurek, R., Sojka, P.: Software Framework for Topic Modeling with Large Corpora. In: Proceedings of LREC workshop New Challenges for NLP Frameworks, pp. 46–50. University of Malta, Valleta (2010)Google Scholar
  12. 12.
    Robertson, S.: Understanding Inverse Document Frequency: On theoretical arguments for IDF. Journal of Documentation 60(5), 503–520Google Scholar
  13. 13.
    Golub, G.H., Reinsch, C.: Singular value decomposition and least square solutions. Numerische Mathematik 10. IV 5(14), 403–420 (1970)CrossRefMathSciNetGoogle Scholar
  14. 14.
    Celikik, M., Bast, H.: Fast error-tolerant search on very large texts. In: SAC 2009 Proceedings of the ACM Symposium on Applied Computing, pp. 1724–1731 (2009)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  • Rajendra Kumar Roul
    • 1
  • Sahil Mittal
    • 1
  • Pravin Joshi
    • 1
  1. 1.BITS Pilani K. K. Birla Goa CampusGoaIndia

Personalised recommendations