Springer Nature is making SARS-CoV-2 and COVID-19 research free. View research | View latest news | Sign up for updates

Archiving the web using page changes patterns: a case study

  • 348 Accesses

  • 7 Citations

Abstract

A pattern is a model or a template used to summarize and describe the behavior (or the trend) of data having generally some recurrent events. Patterns have received a considerable attention in recent years and were widely studied in the data mining field. Various pattern mining approaches have been proposed and used for different applications such as network monitoring, moving object tracking, financial or medical data analysis, scientific data processing, etc. In these different contexts, discovered patterns were useful to detect anomalies, to predict data behavior (or trend) or, more generally, to simplify data processing or to improve system performance. However, to the best of our knowledge, patterns have never been used in the context of Web archiving. Web archiving is the process of continuously collecting and preserving portions of the World Wide Web for future generations. In this paper, we show how patterns of page changes can be useful tools to efficiently archive Websites. We first define our pattern model that describes the importance of page changes. Then, we present the strategy used to (i) extract the temporal evolution of page changes, (ii) discover patterns, to (iii) exploit them to improve Web archives. The archive of French public TV channels France Télévisions is chosen as a case study to validate our approach. Our experimental evaluation based on real Web pages shows the utility of patterns to improve archive quality and to optimize indexing or storing.

This is a preview of subscription content, log in to check access.

References

  1. 1

    Abiteboul, S., Cobena, G., Masanes, J., Sedrati, G.: A First experience in archiving the French Web. In: Proceedings of the 6th European Conference on Research and Advanced Technology for Digital Libraries (2002)

  2. 2

    Adar, E., Teevan, J., Dumais, S.T.: Resonance on the web: web dynamics and revisitation patterns. In: Proceedings of the 27th International Conference on Human Factors in Computing Systems, Boston, MA, USA (2009)

  3. 3

    Adar, E., Teevan, J., Dumais, S.T., Elsas, J.L.: The web changes everything: understanding the dynamics of web content. In: Proceedings of the Second ACM International Conference on Web Search and Data Mining. Barcelona, Spain (2009)

  4. 4

    Baron, S., Spiliopoulou, M.: Monitoring the evolution of web usage patterns. In: Lecture Notes in Computer Science, pp. 181–200. Springer, New York (2004)

  5. 5

    Ben Saad, M., Gançarski, S.: Using visual pages analysis for optimizing web archiving. In: EDBT/ICDT PhD Workshops (2010)

  6. 6

    Bogen, P.L. II, Francisco-Revilla, L., Furuta, R., Hubbard, T., Karadkar, U.P., Shipman, F.: Longitudinal study of changes in blogs. In: Proceedings of the 7th ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL ’07, pp. 135–136. ACM, New York (2007)

  7. 7

    Bogen, P.L. II, Johnston, J., Karadkar, U.P., Furuta, R., Shipman, F.: Application of kalman filters to identify unexpected change in blogs. In: Proceedings of the 8th ACM/IEEE-CS Joint Conference on Digital Libraries. JCDL ’08, pp. 305–312 (2008)

  8. 8

    Brewington, B., Cybenko, G.: How dynamic is the web? In: World Wide Web conference (WWW’2000), pp. 257–276 (2000)

  9. 9

    Brewington B.E., Cybenko G.: Keeping up with the changing web. Computer 33(5), 52–58 (2000)

  10. 10

    Cai, D., Yu, S., Wen, J.-R., Ma, W.-Y.: VIPS: a Vision-based Page Segmentation Algorithm. Technical Report, Microsoft Research (2003)

  11. 11

    Castillo, C., Marin, M., Rodriguez, A., Baeza-Yates, R.: Scheduling algorithms for web crawling. In: LA-WEBMEDIA ’04: Proceedings of the WebMedia & LA-Web 2004 Joint Conference 10th Brazilian Symposium on Multimedia and the Web 2nd Latin American Web Congress, pp. 10–17. (2004)

  12. 12

    Cathro W.: Development of a digital services architecture at the national library of Australia. EduCause, (2003)

  13. 13

    Cheng, H., Yan, X., Han, J., wei Hsu, C.: Discriminative frequent pattern analysis for effective classification. In: ICDE, pp. 716–725 (2007)

  14. 14

    Cho, J., Garcia-Molina, H.: The Evolution of the Web and Implications for an Incremental Crawler. In: VLDB ’00: Proceedings of the 26th International Conference on Very Large Data Bases. (2000)

  15. 15

    Cho J., Garcia-Molina H.: Effective page refresh policies for web crawlers. ACM Trans. Database Syst. 28(4), 390–426 (2003)

  16. 16

    Cho, J., Garcia-Molina, H.: Estimating frequency of change. ACM Trans. Interet Technol. 3(3), (2003)

  17. 17

    Cho, J., Garcia-molina, H., Page, L.: Efficient crawling through url ordering. In: Computer Networks and ISDN Systems, pp. 161–172 (1998)

  18. 18

    Denev D., Mazeika A., Spaniol M., Weikum G.: Sharc: framework for quality-conscious web archiving. Proc. VLDB Endow. 2(1), 586–597 (2009)

  19. 19

    Edwards, J., McCurley, K., Tomlin, J.: An adaptive model for optimizing performance of an incremental web crawler. In: Proceedings of the 10th international conference on World Wide Web, WWW ’01, pp. 106–113 (2001)

  20. 20

    Francisco-Revilla, L., Shipman, F., Furuta, R., Karadkar, U., Arora, A.: Managing change on the web. In: Proceedings of the 1st ACM/IEEE-CS joint conference on Digital libraries, JCDL ’01, pp. 67–76 (2001)

  21. 21

    Francisco-Revilla, L., Shipman, F.M. III, Furuta, R., Karadkar, U., Arora, A.: Perception of content, structure, and presentation changes in web-based hypertext. In: Proceedings of the 12th ACM conference on Hypertext and Hypermedia, HYPERTEXT ’01, pp. 205–214 (2001)

  22. 22

    Gomes, D., Santos, A.L., Silva, M.J.: Managing duplicates in a web archive. In: SAC ’06: Proceedings of the 2006 ACM Symposium on Applied Computing (2006)

  23. 23

    Gruhl, D., Guha, R., Liben-nowell, D., Tomkins, A.: Information diffusion through blogspace. In: WWW ’04, pp. 491–501 (2004)

  24. 24

    Han J., Cheng H., Xin D., Yan X.: Frequent pattern mining: current status and future directions. Data Min. Knowl. Disc. 15, 55–86 (2007)

  25. 25

    Knotek, J.: Information extraction from advertisements. Master’s thesis, Masaryk University (2011)

  26. 26

    Lampos, D.J.C., Eirinaki, M., Vazirgiannis, M.: Archiving the greek web. In: 4th International Web Archiving Workshop (IWAW04). Bath, UK (2004)

  27. 27

    Li, Z., Chen, Z., Srinivasan, S.M., Zhou, Y.: C-miner: Mining block correlations in storage systems. In: Proceedings of the 3rd USENIX Conference on File and Storage Technologies, pp. 173–186 (2004)

  28. 28

    Li, Z., Lu, S., Myagmar, S., Zhou, Y.: Cp-miner: a tool for finding copy-paste and related bugs in operating system code. In: Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation, vol. 6, pp. 20–20 (2004)

  29. 29

    Masanès J.: Web Archiving. Springer, Secaucus (2006)

  30. 30

    Mazeika, A., Denev, D., Spaniol, M., Weikum, G.: The SOLAR System for Sharp Web Archiving. In: Proceedings of the 10 th International Web Archiving Workshop (IWAW), pp. 24–30. Vienna, Austria, September (2010)

  31. 31

    Oita, M., Senellart, P.: Archiving data objects using Web feeds. In: Proceedings of the 10 th International Web Archiving Workshop (IWAW), Vienna, Austria, September (2010)

  32. 32

    Olston, C., Pandey, S.: Recrawl scheduling based on information longevity. In: Proceeding of the 17th International Conference on World Wide Web. WWW ’08, pp. 437–446 (2008)

  33. 33

    Pandey, S., Olston, C.: User-centric web crawling. In: Proceedings of the 14th International Conference on World Wide Web, WWW ’05, pp. 401–411 (2005)

  34. 34

    Pehlivan, Z., Ben Saad, M., Gançarski, S.: Vi-diff: Understanding web pages changes. In: 21st International Conference on Database and Expert Systems Applications (DEXA’10). Bilbao, Spain (2010)

  35. 35

    Saxena, K., Shukla, R.: Significant Interval and Frequent Pattern Discovery in Web Log Data. CoRR, abs/1002.1185 (2010)

  36. 36

    Sia K.C., Cho J., Cho H.-K.: Efficient monitoring algorithm for fast news alerts. IEEE Tran. Knowl. Data Eng. 19, 950–961 (2007)

  37. 37

    Sia, K.C., Cho, J., Hino, K., Chi, Y., Zhu, S., Tseng, B.L.: Monitoring RSS feeds based on user browsing pattern. In: ICWSM ’07: International Conference on Weblogs and Social Media (2007)

  38. 38

    Song, R., Liu, H., Wen, J.-R., Ma.: Learning block importance models for web pages. In: WWW ’04: Proceedings of the 13th International Conference on World Wide Web. (2004)

  39. 39

    Spaniol, M., Denev, D., Mazeika, A., Weikum, G., Senellart, P.: Data quality in web archiving. In: WICOW ’09: Proceedings of the 3rd Workshop on Information Credibility on the Web, pp. 19–26 (2009)

  40. 40

    Spaniol, M., Mazeika, A., Denev, D., Weikum, G.: Catch me if you can: Visual analysis of coherence defects in web archiving. In: 9th International Web Archiving Workshop (IWAW 2009): Workshop Proceecdings, pp. 27–37. Corfu, Greece (2009)

  41. 41

    Srivastava J., Cooley R., Deshpande M., Tan P.-N.: Web usage mining: discovery and applications of usage patterns from web data. SIGKDD Explor. Newsl. 1, 12–23 (2000)

  42. 42

    Ueda, T., Hirate, Y., Yamana, H.: Exploiting idle cpu cores to improve file access performance. In: ICUIMC ’09: Proceedings of the 3rd International Conference on Ubiquitous Information Management and Communication, pp. 529–535 (2009)

  43. 43

    Vaarandi, R.: A data clustering algorithm for mining patterns from event logs. In: IEEE IPOM’03 Proceedings, pp. 119–126. (2003)

  44. 44

    Wolf, J.L., Squillante, M.S., Yu, P.S., Sethuraman, J., Ozsen, L.: Optimal crawling strategies for web search engines. In: Proceedings of the 11th International Conference on World Wide Web, WWW ’02, pp. 136–147. (2002)

  45. 45

    Yang, L.H., Lee, M.L., Hsu, W.: Efficient mining of xml query patterns for caching. In: Proceedings of the 29th International Conference on Very Large Data Bases, VLDB ’2003, vol. 29, pp. 69–80 (2003)

Download references

Author information

Correspondence to Myriam Ben Saad.

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Saad, M.B., Gançarski, S. Archiving the web using page changes patterns: a case study. Int J Digit Libr 13, 33–49 (2012). https://doi.org/10.1007/s00799-012-0094-z

Download citation

Keywords

  • Web archiving
  • Importance of page changes
  • Pattern
  • Temporal completeness