Skip to main content

Advertisement

SpringerLink
Log in
Menu
Find a journal Publish with us
Search
Cart
Book cover

International Conference on Availability, Reliability, and Security

CD-ARES 2012: Multidisciplinary Research and Practice for Information Systems pp 203–217Cite as

  1. Home
  2. Multidisciplinary Research and Practice for Information Systems
  3. Conference paper
Near Duplicate Document Detection for Large Information Flows

Near Duplicate Document Detection for Large Information Flows

  • Daniele Montanari21 &
  • Piera Laura Puglisi22 
  • Conference paper
  • 2073 Accesses

  • 1 Citations

Part of the Lecture Notes in Computer Science book series (LNISA,volume 7465)

Abstract

Near duplicate documents and their detection are studied to identify info items that convey the same (or very similar) content, possibly surrounded by diverse sets of side information like metadata, advertisements, timestamps, web presentations and navigation supports, and so on. Identification of near duplicate information allows the implementation of selection policies aiming to optimize an information corpus and therefore improve its quality.

In this paper, we introduce a new method to find near duplicate documents based on q-grams extracted from the text. The algorithm exploits three major features: a similarity measure comparing document q-gram occurrences to evaluate the syntactic similarity of the compared texts; an indexing method maintaining an inverted index of q-gram; and an efficient allocation of the bitmaps using a window size of 24 hours supporting the documents comparison process.

The proposed algorithm has been tested in a multifeed news content management system to filter out duplicated news items coming from different information channels. The experimental evaluation shows the efficiency and the accuracy of our solution compared with other existing techniques. The results on a real dataset report a F-measure of 9.53 with a similarity threshold of 0.8.

Keywords

  • duplicate
  • information flows
  • q-grams

Download conference paper PDF

References

  1. Berson, T.A.: Differential Cryptanalysis Mod 2 with Applications to MD5. In: Rueppel, R.A. (ed.) EUROCRYPT 1992. LNCS, vol. 658, pp. 71–80. Springer, Heidelberg (1993)

    CrossRef  Google Scholar 

  2. Zhe, W., et al.: Clean-living: Eliminating Near-Duplicates in lifetime Personal Storage. Technical Report (September 2005)

    Google Scholar 

  3. Kumar, J.P., et al.: Duplicate and Near Duplicate Documents Detection: A Review. European Journal of Scientific Research (2009)

    Google Scholar 

  4. Udi, M.: Finding Similar Files in a Large File System. In: USENIX Winter Technical Conference, CA (January 1994)

    Google Scholar 

  5. Andrei, Z., et al.: Some applications of Rabin’s fingerprinting method. Sequences II: Methods in Communications, Security, and Computer Science. Springer (1993)

    Google Scholar 

  6. Chowdhury, A., et al.: Collection statistics for fast duplicate document detection. ACM Transaction on Information Systems 20(2), 171–191 (2002)

    CrossRef  Google Scholar 

  7. Broder, A.Z.: Identifying and Filtering Near-Duplicate Documents. In: Proceedings of COM 2000 (2000)

    Google Scholar 

  8. Gravano, L., et al.: Approximate string joins in a database (almost) for free. In: VLDB 2001 (2001)

    Google Scholar 

  9. Ilinsky, et al.: An efficient method to detect duplicates of Web documents with the use of inverted index

    Google Scholar 

  10. Ferro, A., Giugno, R., Puglisi, P.L., Pulvirenti, A.: An Efficient Duplicate Record Detection Using q-Grams Array Inverted Index. In: Pedersen, T.B., Mohania, M.K., Tjoa, A.M. (eds.) DaWaK 2010. LNCS, vol. 6263, pp. 309–323. Springer, Heidelberg (2010)

    CrossRef  Google Scholar 

  11. Theobald, et al.: SpotSigs: Robust and Efficient Near Duplicate Detection in Large Web Collections. In: Proceedings of SIGIR (2008)

    Google Scholar 

  12. Indyk, P., et al.: Approximate nearest neighbors: towards removing the curse of dimensionality. In: STOC (1998)

    Google Scholar 

  13. http://en.wikipedia.org/wiki/Viral_marketing

  14. http://en.wikipedia.org/wiki/SHA-1

  15. Kolcz, A., et al.: Improved robustness of signature-based near replica detection via lexicon randomization. In: KDD 2004 (2004)

    Google Scholar 

Download references

Author information

Authors and Affiliations

  1. ICT eni - Semantic Technologies, Via Arcoveggio 74/2, Bologna, 40129, Italy

    Daniele Montanari

  2. GESP, Via Marconi 71, Bologna, 40122, Italy

    Piera Laura Puglisi

Authors
  1. Daniele Montanari
    View author publications

    You can also search for this author in PubMed Google Scholar

  2. Piera Laura Puglisi
    View author publications

    You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

  1. Department of IT, Engineering and Environment, University of South Australia, Mawson Lakes Campus, 5001, Adelaide, SA, Australia

    Gerald Quirchmayr

  2. Department of Information Technologies, University of Economics, W. Churchill Sq. 4, 130 67, Prague 3, Czech Republic

    Josef Basl

  3. School of Information Science, Korean Bible University, 16 Danghyun 2-gil, Nowon-gu, 139-791, Seoul, Korea

    Ilsun You

  4. Information Technology and Decision Sciences, Old Dominion University, 2076 Constant Hall, 23529, Norfolk, VA, USA

    Lida Xu

  5. Institute of Software Technology and Interactive Systems, Vienna University of Technology and SBA Research, Favoritenstrsse 9-11, 1040, Vienna, Austria

    Edgar Weippl

Rights and permissions

Reprints and Permissions

Copyright information

© 2012 IFIP International Federation for Information Processing

About this paper

Cite this paper

Montanari, D., Puglisi, P.L. (2012). Near Duplicate Document Detection for Large Information Flows. In: Quirchmayr, G., Basl, J., You, I., Xu, L., Weippl, E. (eds) Multidisciplinary Research and Practice for Information Systems. CD-ARES 2012. Lecture Notes in Computer Science, vol 7465. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-32498-7_16

Download citation

  • .RIS
  • .ENW
  • .BIB
  • DOI: https://doi.org/10.1007/978-3-642-32498-7_16

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-32497-0

  • Online ISBN: 978-3-642-32498-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Share this paper

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Search

Navigation

  • Find a journal
  • Publish with us

Discover content

  • Journals A-Z
  • Books A-Z

Publish with us

  • Publish your research
  • Open access publishing

Products and services

  • Our products
  • Librarians
  • Societies
  • Partners and advertisers

Our imprints

  • Springer
  • Nature Portfolio
  • BMC
  • Palgrave Macmillan
  • Apress
  • Your US state privacy rights
  • Accessibility statement
  • Terms and conditions
  • Privacy policy
  • Help and support

167.114.118.210

Not affiliated

Springer Nature

© 2023 Springer Nature