The VLDB Journal

, Volume 20, Issue 2, pp 183–207 | Cite as

The SHARC framework for data quality in Web archiving

  • Dimitar Denev
  • Arturas Mazeika
  • Marc Spaniol
  • Gerhard Weikum
Special Issue Paper

Abstract

Web archives preserve the history of born-digital content and offer great potential for sociologists, business analysts, and legal experts on intellectual property and compliance issues. Data quality is crucial for these purposes. Ideally, crawlers should gather coherent captures of entire Web sites, but the politeness etiquette and completeness requirement mandate very slow, long-duration crawling while Web sites undergo changes. This paper presents the SHARC framework for assessing the data quality in Web archives and for tuning capturing strategies toward better quality with given resources. We define data quality measures, characterize their properties, and develop a suite of quality-conscious scheduling strategies for archive crawling. Our framework includes single-visit and visit–revisit crawls. Single-visit crawls download every page of a site exactly once in an order that aims to minimize the “blur” in capturing the site. Visit–revisit strategies revisit pages after their initial downloads to check for intermediate changes. The revisiting order aims to maximize the “coherence” of the site capture(number pages that did not change during the capture). The quality notions of blur and coherence are formalized in the paper. Blur is a stochastic notion that reflects the expected number of page changes that a time-travel access to a site capture would accidentally see, instead of the ideal view of a instantaneously captured, “sharp” site. Coherence is a deterministic quality measure that counts the number of unchanged and thus coherently captured pages in a site snapshot. Strategies that aim to either minimize blur or maximize coherence are based on prior knowledge of or predictions for the change rates of individual pages. Our framework includes fairly accurate classifiers for change predictions. All strategies are fully implemented in a testbed and shown to be effective by experiments with both synthetically generated sites and a periodic crawl series for different Web sites.

Keywords

Web archiving Data quality Blur Coherence Crawls strategies 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Adar, E., Teevan, J., Dumais, S.T., Elsas, J.L.: The Web changes everything: understanding the dynamics of Web content. In: WSDM’09, pp. 282–291 (2009)Google Scholar
  2. 2.
    Alam, Md.H., Ha, J., Lee, S.: Fractional pagerank crawler: Prioritizing URLs efficiently for crawling important pages early. In: DASFAA’09, pp. 590–594 (2009)Google Scholar
  3. 3.
    Segev A., Shoshani A.: Logical modeling of temporal data. SIGMOD Rec. 16(3), 454–466 (1987)CrossRefGoogle Scholar
  4. 4.
    Baeza-Yates R., Gionis A., Junqueira F., Murdock V., Plachoura V., Silvestri F.: Design trade-offs for search engine caching. ACM Trans. Web 2(4), 1–28 (2008)CrossRefGoogle Scholar
  5. 5.
    Batsakis S., Petrakis E.G.M., Milios E.E.: Improving the performance of focused Web crawlers. Data Knowl. Eng. 68(10), 1001–1013 (2009)CrossRefGoogle Scholar
  6. 6.
    Baykan, E., Henzinger, M., Marian, L., Weber, I.: Purely URL-based topic classification. In: WWW’09, pp. 1109–1110 (2009)Google Scholar
  7. 7.
    Brewington B.E., Cybenko G.: Keeping up with the changing Web. Computer 33(5), 52–58 (2000)CrossRefGoogle Scholar
  8. 8.
    Castillo, C., Marin, M., Rodriguez, A., Baeza-Yates, R.: Scheduling algorithms for Web crawling. In: LA-WEBMEDIA’04, pp. 10–17 (2004)Google Scholar
  9. 9.
    Chen L., Bhowmick S.S., Nejdl W.: Near-miner: mining evolution associations of Web site directories for efficient maintenance of Web archives. PVLDB 2(1), 1150–1161 (2009)Google Scholar
  10. 10.
    Cho J., Garcia-Molina H.: Synchronizing a database to improve freshness. SIGMOD Rec. 29(2), 117–128 (2000)CrossRefGoogle Scholar
  11. 11.
    Cho J., Garcia-Molina H.: Estimating frequency of change. ACM Trans. Inter. Tech. 3(3), 256–290 (2003)CrossRefGoogle Scholar
  12. 12.
    Cho J., Garcia-Molina H., Page L. (2007) Efficient crawling through URL ordering. In: WWW’07, pp. 161–172. (2007)Google Scholar
  13. 13.
    Cho J., Ntoulas A. (2002) Effective change detection using sampling. In: VLDB’02, pp. 514–525. (2002)Google Scholar
  14. 14.
    Cho J., Schonfeld U. (2007) Rankmass crawler: a crawler with high personalized pagerank coverage guarantee. In: VLDB’07, pp. 375–386. (2007)Google Scholar
  15. 15.
    Cho J., Garcia-Molina H.: Estimating frequency of change. ACM Trans. Internet Technol. 3(3), 256–290 (2003)CrossRefGoogle Scholar
  16. 16.
    Colby L.S., Kawaguchi A., Lieuwen D.F., Mumick I.S., Ross K.A.: Supporting multiple view maintenance policies. SIGMOD Rec. 26(2), 405–416 (1997)CrossRefGoogle Scholar
  17. 17.
    Dai, N., Davison, B.D.: Freshness matters: in flowers, food, and Web authority. In: SIGIR’10, pp. 114–121 (2010)Google Scholar
  18. 18.
    Dash, D., Kantere, V., Ailamaki, A.: An economic model for self-tuned cloud caching. In: ICDE’09, pp. 1687–1693 (2009)Google Scholar
  19. 19.
    Denev D., Mazeika A., Spaniol M., Weikum G.: Sharc: framework for quality-conscious Web archiving. PVLDB 2(1), 586–597 (2009)Google Scholar
  20. 20.
    Masanès, J. (eds): Web Archiving. Springer, UK (2006)Google Scholar
  21. 21.
    Härder, T., Bühmann, A.: Value complete, column complete, predicate complete. In: VLDBJ 17(4), pp. 805–826 (2008)Google Scholar
  22. 22.
    Jiawei M., Han J.: Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA (2005)Google Scholar
  23. 23.
    Kan, M.-Y., Thi, H.O.N.: Fast Webpage classification using URL features. In: CIKM’05, pp. 325–326 (2005)Google Scholar
  24. 24.
    Kim, S., Lee, S.: Estimating the change of Web pages. In: ICCS’07, Vol. 4489 of LNCS, pp. 798–805 (2007)Google Scholar
  25. 25.
    Lee, H.-T., Leonard, D., Wang, X., Loguinov, D.: Irlbot: scaling to 6 billion pages and beyond. In: WWW’08, pp. 427–436 (2008)Google Scholar
  26. 26.
    Levene, M., Poulovassilis, A. (eds): Web Dynamics—Adapting to Change in Content, Size, Topology and Use. Springer, Berlin (2004)MATHGoogle Scholar
  27. 27.
    Mohr, G., Kimpton, M., Stack, M., Ranitovic, I.: Introduction to Heritrix, an archival quality Web crawler. In: IWAW’04 (2004)Google Scholar
  28. 28.
    Najork, M., Wiener, J.L.: Breadth-first search crawling yields high-quality pages. In: WWW’01, pp. 114–118 (2001)Google Scholar
  29. 29.
    Ntoulas, A., Cho, J., Olston, C.: What’s new on the Web?: the evolution of the Web from a search engine perspective. In: WWW’04, pp. 1–12 (2004)Google Scholar
  30. 30.
    Olston, C., Pandey, S.: Recrawl scheduling based on information longevity. In: WWW’08, pp. 437–446 (2008)Google Scholar
  31. 31.
    Olston, C., Widom, J.: Best-effort cache synchronization with source cooperation. In: In SIGMOD’02, pp. 73–84 (2002)Google Scholar
  32. 32.
    Practice.com. Debunking the wayback machine. http://practice.com/2008/12/29/debunking-the-wayback-machine
  33. 33.
    Qi X., Davison B.D.: Web page classification: features and algorithms. ACM Comput. Surv. 41(2), 1–31 (2009)CrossRefGoogle Scholar
  34. 34.
    Schenkel, R.: Temporal shingling for version identification in Web archives. In: ECIR’10, pp. 508–519 (2010)Google Scholar
  35. 35.
    Schonfeld, U., Shivakumar, N.: Sitemaps: above and beyond the crawl of duty. In: WWW’09, pp. 991–1000 (2009)Google Scholar
  36. 36.
    Spaniol, M., Denev, D., Mazeika, A., Weikum, G., Senellart, P.: Data quality in Web archiving. In: WICOW’09, pp. 19–26 (2009)Google Scholar
  37. 37.
    Tolia, N., Satyanarayanan, M.: Consistency-preserving caching of dynamic database content. In: WWW’07, pp. 311–320 (2007)Google Scholar
  38. 38.
    Singh, S.R. (2007) Estimating the rate of Web page updates. In: IJCAI’07, pp. 2874–2879 (2007)Google Scholar
  39. 39.
    Zheng, S., Dmitriev, P., Giles, C.L.: Graph-based seed selection for Web-scale crawlers. In: CIKM’09, pp. 1967–1970 (2009)Google Scholar
  40. 40.
    Zhou, Y., Jiang, M., Zhang, Q., Huang, X., Wu, L.: Selective recrawling for object-level vertical search. In: WWW’10, pp. 1221–1222 (2010)Google Scholar

Copyright information

© Springer-Verlag 2011

Authors and Affiliations

  • Dimitar Denev
    • 1
  • Arturas Mazeika
    • 1
  • Marc Spaniol
    • 1
  • Gerhard Weikum
    • 1
  1. 1.Max Planck Institute for InformaticsSaarbrückenGermany

Personalised recommendations