Temporal Shingling for Version Identification in Web Archives

  • Ralf Schenkel
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5993)

Abstract

Building and preserving archives of the evolving Web has been an important problem in research. Given the huge volume of content that is added or updated daily, identifying the right versions of pages to store in the archive is an important building block of any large-scale archival system. This paper presents temporal shingling, an extension of the well-established shingling technique for measuring how similar two snapshots of a page are. This novel method considers the lifespan of shingles to differentiate between important updates that should be archived and transient changes that may be ignored. Extensive experiments demonstrate the tradeoff between archive size and version coverage, and show that the novel method yields better archive coverage at smaller sizes than existing techniques.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Anand, A., et al.: EverLast: a distributed architecture for preserving the web. In: JCDL, pp. 331–340 (2009)Google Scholar
  2. 2.
    Brin, S., Davis, J., Garcia-Molina, H.: Copy detection mechanisms for digital documents. In: SIGMOD Conference, pp. 398–409 (1995)Google Scholar
  3. 3.
    Broder, A.Z.: Identifying and filtering near-duplicate documents. In: Giancarlo, R., Sankoff, D. (eds.) CPM 2000. LNCS, vol. 1848, pp. 1–10. Springer, Heidelberg (2000)CrossRefGoogle Scholar
  4. 4.
    Broder, A.Z., Glassman, S.C., Manasse, M.S., Zweig, G.: Syntactic clustering of the web. Computer Networks 29(8-13), 1157–1166 (1997)Google Scholar
  5. 5.
    Charikar, M.: Similarity estimation techniques from rounding algorithms. In: STOC, pp. 380–388 (2002)Google Scholar
  6. 6.
    Cho, J., Garcia-Molina, H.: Effective page refresh policies for web crawlers. ACM Trans. Database Syst. 28(4), 390–426 (2003)CrossRefGoogle Scholar
  7. 7.
    Cho, J., Garcia-Molina, H.: Estimating frequency of change. ACM Trans. Internet Techn. 3(3), 256–290 (2003)CrossRefGoogle Scholar
  8. 8.
    Chowdhury, A., et al.: Collection statistics for fast duplicate document detection. ACM Trans. Inf. Syst. 20(2), 171–191 (2002)CrossRefGoogle Scholar
  9. 9.
    Conrad, J.G., et al.: Online duplicate document detection: signature reliability in a dynamic retrieval environment. In: CIKM, pp. 443–452 (2003)Google Scholar
  10. 10.
    Henzinger, M.R.: Finding near-duplicate web pages: a large-scale evaluation of algorithms. In: SIGIR, pp. 284–291 (2006)Google Scholar
  11. 11.
    Hoad, T.C., Zobel, J.: Methods for identifying versioned and plagiarized documents. JASIST 54(3), 203–215 (2003)CrossRefGoogle Scholar
  12. 12.
    Kolcz, A., Chowdhury, A., Alspector, J.: Improved robustness of signature-based near-replica detection via lexicon randomization. In: KDD, pp. 605–610 (2004)Google Scholar
  13. 13.
    Manber, U.: Finding similar files in a large file system. In: USENIX Winter, pp. 1–10 (1994)Google Scholar
  14. 14.
    Manku, G.S., Jain, A., Sarma, A.D.: Detecting near-duplicates for web crawling. In: WWW, pp. 141–150 (2007)Google Scholar
  15. 15.
    Olston, C., Pandey, S.: Recrawl scheduling based on information longevity. In: WWW, pp. 437–446 (2008)Google Scholar
  16. 16.
    Theobald, M., Siddharth, J., Paepcke, A.: SpotSigs: robust and efficient near duplicate detection in large web collections. In: SIGIR, pp. 563–570 (2008)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2010

Authors and Affiliations

  • Ralf Schenkel
    • 1
  1. 1.Saarland UniversitySaarbrückenGermany

Personalised recommendations