International Journal on Digital Libraries

, Volume 13, Issue 1, pp 33–49

Archiving the web using page changes patterns: a case study

Article

DOI: 10.1007/s00799-012-0094-z

Cite this article as:
Saad, M.B. & Gançarski, S. Int J Digit Libr (2012) 13: 33. doi:10.1007/s00799-012-0094-z

Abstract

A pattern is a model or a template used to summarize and describe the behavior (or the trend) of data having generally some recurrent events. Patterns have received a considerable attention in recent years and were widely studied in the data mining field. Various pattern mining approaches have been proposed and used for different applications such as network monitoring, moving object tracking, financial or medical data analysis, scientific data processing, etc. In these different contexts, discovered patterns were useful to detect anomalies, to predict data behavior (or trend) or, more generally, to simplify data processing or to improve system performance. However, to the best of our knowledge, patterns have never been used in the context of Web archiving. Web archiving is the process of continuously collecting and preserving portions of the World Wide Web for future generations. In this paper, we show how patterns of page changes can be useful tools to efficiently archive Websites. We first define our pattern model that describes the importance of page changes. Then, we present the strategy used to (i) extract the temporal evolution of page changes, (ii) discover patterns, to (iii) exploit them to improve Web archives. The archive of French public TV channels France Télévisions is chosen as a case study to validate our approach. Our experimental evaluation based on real Web pages shows the utility of patterns to improve archive quality and to optimize indexing or storing.

Keywords

Web archiving Importance of page changes Pattern Temporal completeness 

Copyright information

© Springer-Verlag 2012

Authors and Affiliations

  1. 1.LIP6, University Pierre and Marie CurieParisFrance