Improving the Quality of Web Archives through the Importance of Changes

  • Myriam Ben Saad
  • Stéphane Gançarski
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6860)


Due to the growing importance of the Web, several archiving institutes (national libraries, Internet Archive, etc.) are harvesting sites to preserve (a part of) the Web for future generations. A major issue encountered by archivists is to preserve the quality of web archives. One way of assessing the quality of an archive is to quantify its completeness and the coherence of its page versions. Due to the large number of pages to be captured and the limitations of resources (storage space, bandwidth, etc.), it is impossible to have a complete archive (containing all the versions of all the pages). Also it is impossible to assure the coherence of all captured versions because pages are changing very frequently during the crawl of a site. Nonetheless, it is possible to maximize the quality of archives by adjusting web crawlers strategy. Our idea for that is (i) to improve the completeness of the archive by downloading the most important versions and (ii) to keep the most important versions as coherent as possible. Moreover, we introduce a pattern model which describes the behavior of the importance of pages changes over time. Based on patterns, we propose a crawl strategy to improve both the completeness and the coherence of web archives. Experiments based on real patterns show the usefulness and the effectiveness of our approach.


Web Archiving Data Quality Change Importance Pattern 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Adar, E., Teevan, J., Dumais, S.T., Elsas, J.L.: The web changes everything: understanding the dynamics of web content. In: Proceedings of the Second ACM International Conference on Web Search and Data Mining, Barcelona, Spain (2009)Google Scholar
  2. 2.
    Ben Saad, M., Gançarski, S.: Using visual pages analysis for optimizing web archiving. In: EDBT/ICDT PhD Workshops, Lausanne, Switzerland (2010)Google Scholar
  3. 3.
    Ben Saad, M., Gançarski, S.: Archiving the Web using Page Changes Pattern: A Case Study. In: ACM/IEEE Joint Conference on Digital Libraries (JCDL 2011), Ottawa, Canada (2011)Google Scholar
  4. 4.
    Brewington, B.E., Cybenko, G.: Keeping up with the changing web. Computer 33(5) (2000)Google Scholar
  5. 5.
    Castillo, C., Marin, M., Rodriguez, A., Baeza-Yates, R.: Scheduling algorithms for web crawling. In: LA-WEBMEDIA 2004: Proceedings of the WebMedia (2004)Google Scholar
  6. 6.
    Cho, J., Garcia-Molina, H.: The Evolution of the Web and Implications for an Incremental Crawler. In: VLDB 2000: Proceedings of the 26th International Conference on Very Large Data Bases, pp. 200–209. San Francisco, CA, USA (2000)Google Scholar
  7. 7.
    Cho, J., Garcia-Molina, H.: Effective page refresh policies for web crawlers. ACM Trans. Database Syst. 28(4), 390–426 (2003)CrossRefGoogle Scholar
  8. 8.
    Cho, J., Garcia-molina, H.: Estimating frequency of change. ACM Transactions on Internet Technology 3, 256–290 (2003)CrossRefGoogle Scholar
  9. 9.
    Cho, J., Garcia-molina, H., Page, L.: Efficient crawling through url ordering. In: Computer Networks and ISDN Systems, pp. 161–172 (1998)Google Scholar
  10. 10.
    Denev, D., Mazeika, A., Spaniol, M., Weikum, G.: Sharc: framework for quality-conscious web archiving. Proc. VLDB Endow. 2(1), 586–597 (2009)CrossRefGoogle Scholar
  11. 11.
    Han, J., Cheng, H., Xin, D., Yan, X.: Frequent pattern mining: current status and future directions. In: Data Mining and Knowledge Discovery, vol. 15 (2007)Google Scholar
  12. 12.
    Masanès, J.: Web Archiving. Springer, New York (2006)CrossRefGoogle Scholar
  13. 13.
    Olston, C., Pandey, S.: Recrawl scheduling based on information longevity. In: Proceeding of the 17th International Conference on World Wide Web (2008)Google Scholar
  14. 14.
    Pehlivan, Z., Ben-Saad, M., Gançarski, S.: Vi-DIFF: Understanding web pages changes. In: Bringas, P.G., Hameurlain, A., Quirchmayr, G. (eds.) DEXA 2010. LNCS, vol. 6261, pp. 1–15. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  15. 15.
    Sia, K.C., Cho, J., Cho, H.-K.: Efficient monitoring algorithm for fast news alerts. IEEE Transactions on Knowledge and Data Engineering 19, 950–961 (2007)CrossRefGoogle Scholar
  16. 16.
    Spaniol, M., Denev, D., Mazeika, A., Weikum, G., Senellart, P.: Data quality in web archiving. In: WICOW 2009: Proceedings of the 3rd Workshop on Information Credibility on the Web, pp. 19–26 (2009)Google Scholar
  17. 17.
    Spaniol, M., Mazeika, A., Denev, D., Weikum, G.: ”catch me if you can”: Visual analysis of coherence defects in web archiving. In: 9th International Web Archiving Workshop (IWAW 2009), Corfu, Greece, pp. 27–37 (2009)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  • Myriam Ben Saad
    • 1
  • Stéphane Gançarski
    • 1
  1. 1.LIP6University P. and M. CurieParisFrance

Personalised recommendations