The SHARC framework for data quality in Web archiving
- 195 Downloads
- 6 Citations
Abstract
Web archives preserve the history of born-digital content and offer great potential for sociologists, business analysts, and legal experts on intellectual property and compliance issues. Data quality is crucial for these purposes. Ideally, crawlers should gather coherent captures of entire Web sites, but the politeness etiquette and completeness requirement mandate very slow, long-duration crawling while Web sites undergo changes. This paper presents the SHARC framework for assessing the data quality in Web archives and for tuning capturing strategies toward better quality with given resources. We define data quality measures, characterize their properties, and develop a suite of quality-conscious scheduling strategies for archive crawling. Our framework includes single-visit and visit–revisit crawls. Single-visit crawls download every page of a site exactly once in an order that aims to minimize the “blur” in capturing the site. Visit–revisit strategies revisit pages after their initial downloads to check for intermediate changes. The revisiting order aims to maximize the “coherence” of the site capture(number pages that did not change during the capture). The quality notions of blur and coherence are formalized in the paper. Blur is a stochastic notion that reflects the expected number of page changes that a time-travel access to a site capture would accidentally see, instead of the ideal view of a instantaneously captured, “sharp” site. Coherence is a deterministic quality measure that counts the number of unchanged and thus coherently captured pages in a site snapshot. Strategies that aim to either minimize blur or maximize coherence are based on prior knowledge of or predictions for the change rates of individual pages. Our framework includes fairly accurate classifiers for change predictions. All strategies are fully implemented in a testbed and shown to be effective by experiments with both synthetically generated sites and a periodic crawl series for different Web sites.
Keywords
Web archiving Data quality Blur Coherence Crawls strategiesPreview
Unable to display preview. Download preview PDF.
References
- 1.Adar, E., Teevan, J., Dumais, S.T., Elsas, J.L.: The Web changes everything: understanding the dynamics of Web content. In: WSDM’09, pp. 282–291 (2009)Google Scholar
- 2.Alam, Md.H., Ha, J., Lee, S.: Fractional pagerank crawler: Prioritizing URLs efficiently for crawling important pages early. In: DASFAA’09, pp. 590–594 (2009)Google Scholar
- 3.Segev A., Shoshani A.: Logical modeling of temporal data. SIGMOD Rec. 16(3), 454–466 (1987)CrossRefGoogle Scholar
- 4.Baeza-Yates R., Gionis A., Junqueira F., Murdock V., Plachoura V., Silvestri F.: Design trade-offs for search engine caching. ACM Trans. Web 2(4), 1–28 (2008)CrossRefGoogle Scholar
- 5.Batsakis S., Petrakis E.G.M., Milios E.E.: Improving the performance of focused Web crawlers. Data Knowl. Eng. 68(10), 1001–1013 (2009)CrossRefGoogle Scholar
- 6.Baykan, E., Henzinger, M., Marian, L., Weber, I.: Purely URL-based topic classification. In: WWW’09, pp. 1109–1110 (2009)Google Scholar
- 7.Brewington B.E., Cybenko G.: Keeping up with the changing Web. Computer 33(5), 52–58 (2000)CrossRefGoogle Scholar
- 8.Castillo, C., Marin, M., Rodriguez, A., Baeza-Yates, R.: Scheduling algorithms for Web crawling. In: LA-WEBMEDIA’04, pp. 10–17 (2004)Google Scholar
- 9.Chen L., Bhowmick S.S., Nejdl W.: Near-miner: mining evolution associations of Web site directories for efficient maintenance of Web archives. PVLDB 2(1), 1150–1161 (2009)Google Scholar
- 10.Cho J., Garcia-Molina H.: Synchronizing a database to improve freshness. SIGMOD Rec. 29(2), 117–128 (2000)CrossRefGoogle Scholar
- 11.Cho J., Garcia-Molina H.: Estimating frequency of change. ACM Trans. Inter. Tech. 3(3), 256–290 (2003)CrossRefGoogle Scholar
- 12.Cho J., Garcia-Molina H., Page L. (2007) Efficient crawling through URL ordering. In: WWW’07, pp. 161–172. (2007)Google Scholar
- 13.Cho J., Ntoulas A. (2002) Effective change detection using sampling. In: VLDB’02, pp. 514–525. (2002)Google Scholar
- 14.Cho J., Schonfeld U. (2007) Rankmass crawler: a crawler with high personalized pagerank coverage guarantee. In: VLDB’07, pp. 375–386. (2007)Google Scholar
- 15.Cho J., Garcia-Molina H.: Estimating frequency of change. ACM Trans. Internet Technol. 3(3), 256–290 (2003)CrossRefGoogle Scholar
- 16.Colby L.S., Kawaguchi A., Lieuwen D.F., Mumick I.S., Ross K.A.: Supporting multiple view maintenance policies. SIGMOD Rec. 26(2), 405–416 (1997)CrossRefGoogle Scholar
- 17.Dai, N., Davison, B.D.: Freshness matters: in flowers, food, and Web authority. In: SIGIR’10, pp. 114–121 (2010)Google Scholar
- 18.Dash, D., Kantere, V., Ailamaki, A.: An economic model for self-tuned cloud caching. In: ICDE’09, pp. 1687–1693 (2009)Google Scholar
- 19.Denev D., Mazeika A., Spaniol M., Weikum G.: Sharc: framework for quality-conscious Web archiving. PVLDB 2(1), 586–597 (2009)Google Scholar
- 20.Masanès, J. (eds): Web Archiving. Springer, UK (2006)Google Scholar
- 21.Härder, T., Bühmann, A.: Value complete, column complete, predicate complete. In: VLDBJ 17(4), pp. 805–826 (2008)Google Scholar
- 22.Jiawei M., Han J.: Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA (2005)Google Scholar
- 23.Kan, M.-Y., Thi, H.O.N.: Fast Webpage classification using URL features. In: CIKM’05, pp. 325–326 (2005)Google Scholar
- 24.Kim, S., Lee, S.: Estimating the change of Web pages. In: ICCS’07, Vol. 4489 of LNCS, pp. 798–805 (2007)Google Scholar
- 25.Lee, H.-T., Leonard, D., Wang, X., Loguinov, D.: Irlbot: scaling to 6 billion pages and beyond. In: WWW’08, pp. 427–436 (2008)Google Scholar
- 26.Levene, M., Poulovassilis, A. (eds): Web Dynamics—Adapting to Change in Content, Size, Topology and Use. Springer, Berlin (2004)MATHGoogle Scholar
- 27.Mohr, G., Kimpton, M., Stack, M., Ranitovic, I.: Introduction to Heritrix, an archival quality Web crawler. In: IWAW’04 (2004)Google Scholar
- 28.Najork, M., Wiener, J.L.: Breadth-first search crawling yields high-quality pages. In: WWW’01, pp. 114–118 (2001)Google Scholar
- 29.Ntoulas, A., Cho, J., Olston, C.: What’s new on the Web?: the evolution of the Web from a search engine perspective. In: WWW’04, pp. 1–12 (2004)Google Scholar
- 30.Olston, C., Pandey, S.: Recrawl scheduling based on information longevity. In: WWW’08, pp. 437–446 (2008)Google Scholar
- 31.Olston, C., Widom, J.: Best-effort cache synchronization with source cooperation. In: In SIGMOD’02, pp. 73–84 (2002)Google Scholar
- 32.Practice.com. Debunking the wayback machine. http://practice.com/2008/12/29/debunking-the-wayback-machine
- 33.Qi X., Davison B.D.: Web page classification: features and algorithms. ACM Comput. Surv. 41(2), 1–31 (2009)CrossRefGoogle Scholar
- 34.Schenkel, R.: Temporal shingling for version identification in Web archives. In: ECIR’10, pp. 508–519 (2010)Google Scholar
- 35.Schonfeld, U., Shivakumar, N.: Sitemaps: above and beyond the crawl of duty. In: WWW’09, pp. 991–1000 (2009)Google Scholar
- 36.Spaniol, M., Denev, D., Mazeika, A., Weikum, G., Senellart, P.: Data quality in Web archiving. In: WICOW’09, pp. 19–26 (2009)Google Scholar
- 37.Tolia, N., Satyanarayanan, M.: Consistency-preserving caching of dynamic database content. In: WWW’07, pp. 311–320 (2007)Google Scholar
- 38.Singh, S.R. (2007) Estimating the rate of Web page updates. In: IJCAI’07, pp. 2874–2879 (2007)Google Scholar
- 39.Zheng, S., Dmitriev, P., Giles, C.L.: Graph-based seed selection for Web-scale crawlers. In: CIKM’09, pp. 1967–1970 (2009)Google Scholar
- 40.Zhou, Y., Jiang, M., Zhang, Q., Huang, X., Wu, L.: Selective recrawling for object-level vertical search. In: WWW’10, pp. 1221–1222 (2010)Google Scholar