Advertisement

Exploiting Sentence-Level Features for Near-Duplicate Document Detection

  • Jenq-Haur Wang
  • Hung-Chi Chang
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5839)

Abstract

Digital documents are easy to copy. How to effectively detect possible near-duplicate copies is critical in Web search. Conventional copy detection approaches such as document fingerprinting and bag-of-word similarity target at different levels of granularity in document features, from word n-grams to whole documents. In this paper, we focus on the mutual-inclusive type of near-duplicates where only partial overlap among documents makes them similar. We propose using a simple and compact sentence-level feature, the sequence of sentence lengths, for near-duplicate copy detection. Various configurations of sentence-level and word-level algorithms are evaluated. The experimental results show that sentence-level algorithms achieved higher efficiency with comparable precision and recall rates.

Keywords

Near-duplicate sentence-level copy detection mutual inclusive 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Bernstein, Y., Zobel, J.: Accurate Discovery of Co-derivative Documents via Duplicate Text Detection. Information Systems 31(7), 595–609 (2006)CrossRefGoogle Scholar
  2. 2.
    Brin, S., Davis, J., Garcia-Molina, H.: Copy Detection Mechanisms for Digital Documents. In: The 1995 ACM International Conference on Management of Data (SIGMOD 1995), pp. 398–409 (1995)Google Scholar
  3. 3.
    Broder, A.: On the Resemblance and Containment of Documents. In: Compression and Complexity of Sequences, pp. 21–29 (1997)Google Scholar
  4. 4.
    Broder, A., Glassman, S., Manasse, M., Zweig, G.: Syntactic Clustering of the Web. In: The 6th International Conference on World Wide Web (WWW 1997), pp. 393–404 (1997)Google Scholar
  5. 5.
    Chang, H.C., Wang, J.H.: Organizing News Archives by Near-duplicate Copy Detection in Digital Libraries. In: Goh, D.H.-L., Cao, T.H., Sølvberg, I.T., Rasmussen, E. (eds.) ICADL 2007. LNCS, vol. 4822, pp. 410–419. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  6. 6.
    Chang, H.C., Wang, J.H., Chiu, C.Y.: Finding Event-Relevant Content from the Web Using a Near-duplicate Detection Approach. In: The 2007 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2007), pp. 291–294 (2007)Google Scholar
  7. 7.
    Charikar, M.S.: Similarity Estimation Techniques from Rounding Algorithms. In: The 34th Annual ACM Symposium on Theory of Computing (STOC 2002), pp. 380–388 (2002)Google Scholar
  8. 8.
    Chowdhury, A., Frieder, O., Grossman, D., McCabe, M.C.: Collection Statistics for Fast Duplicate Document Detection. ACM Transactions on Information Systems (TOIS) 20(2), 171–191 (2002)CrossRefGoogle Scholar
  9. 9.
    Damerau, F.J.: A Technique for Computer Detection and Correction of Spelling Errors. Communications of the ACM 7(3), 171–176 (1964)CrossRefGoogle Scholar
  10. 10.
    Fetterly, D., Manasse, M., Najork, M.: Detecting Phrase-level Duplication on the World Wide Web. In: The 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2005), pp. 170–177 (2005)Google Scholar
  11. 11.
    Heintze, N.: Scalable Document Fingerprinting. In: The 2nd USENIX Workshop on Electronic Commerce (1996)Google Scholar
  12. 12.
    Henzinger, M.: Finding Near-duplicate Web Pages: A Large-scale Evaluation of Algorithms. In: The 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2006), pp. 284–291 (2006)Google Scholar
  13. 13.
    Hoad, T.C., Zobel, J.: Methods for Identifying Versioned and Plagiarized Documents. Journal of the American Society for Information Science and Technology 54(3), 203–215 (2003)CrossRefGoogle Scholar
  14. 14.
    Huffman, S.B., Lehman, A.R., Stolboushkin, A.P., Wong-Toi, H., Yang, F., Roehrig, H.: Multiple-signal Duplicate Detection for Search Evaluation. In: The 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2007), pp. 223–230 (2007)Google Scholar
  15. 15.
    Manber, U.: Finding Similar Files in a Large File System. In: USENIX Winter Technical Conference, pp. 1–10 (1994)Google Scholar
  16. 16.
    Manku, G.S., Jain, A., Sarma, A.D.: Detecting Near-duplicates for Web Crawling. In: The 16th International Conference on World Wide Web (WWW 2007), pp. 141–150 (2007)Google Scholar
  17. 17.
    Metzler, D., Bernstein, Y., Croft, W.B., Moffat, A., Zobel, J.: Similarity Measures for Tracking Information Flow. In: The 14th ACM Conference on Information and Knowledge Management (CIKM 2005), pp. 517–524 (2005)Google Scholar
  18. 18.
    NIST. Secure hash standard. Federal Information Processing Standards, FIPS 180-1 (1995)Google Scholar
  19. 19.
    NTCIR (NII Test Collection for IR Systems) project, http://research.nii.ac.jp/ntcir/ (accessed on January 23, 2009)
  20. 20.
    Seo, J., Croft, W.B.: Local Text Reuse Detection. In: The 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2008), pp. 571–578 (2008)Google Scholar
  21. 21.
    Shivakumar, N., Garcia-Molina, H.: SCAM: A Copy Detection Mechanism for Digital Documents. In: International Conference on Theory and Practice of Digital Libraries (1995)Google Scholar
  22. 22.
    Theobald, M., Siddharth, J., Paepcke, A.: SpotSigs: Robust and Efficient Near Duplicate Detection in Large Web Collections. In: The 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2008), pp. 563–570 (2008)Google Scholar
  23. 23.
    Xiao, C., Wang, W., Lin, X., Yu, J.X.: Efficient Similarity Joins for Near Duplicate Detection. In: The 17th International Conference on World Wide Web (WWW 2008), pp. 131–140 (2008)Google Scholar
  24. 24.
    Yang, H., Callan, J.: Near-duplicate Detection by Instance-level Constrained Clustering. In: The 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2006), pp. 421–428 (2006)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2009

Authors and Affiliations

  • Jenq-Haur Wang
    • 1
  • Hung-Chi Chang
    • 2
  1. 1.National Taipei University of TechnologyTaiwan
  2. 2.Academia SinicaTaiwan

Personalised recommendations