International Journal on Digital Libraries

, Volume 13, Issue 1, pp 33–49 | Cite as

Archiving the web using page changes patterns: a case study

Article

Abstract

A pattern is a model or a template used to summarize and describe the behavior (or the trend) of data having generally some recurrent events. Patterns have received a considerable attention in recent years and were widely studied in the data mining field. Various pattern mining approaches have been proposed and used for different applications such as network monitoring, moving object tracking, financial or medical data analysis, scientific data processing, etc. In these different contexts, discovered patterns were useful to detect anomalies, to predict data behavior (or trend) or, more generally, to simplify data processing or to improve system performance. However, to the best of our knowledge, patterns have never been used in the context of Web archiving. Web archiving is the process of continuously collecting and preserving portions of the World Wide Web for future generations. In this paper, we show how patterns of page changes can be useful tools to efficiently archive Websites. We first define our pattern model that describes the importance of page changes. Then, we present the strategy used to (i) extract the temporal evolution of page changes, (ii) discover patterns, to (iii) exploit them to improve Web archives. The archive of French public TV channels France Télévisions is chosen as a case study to validate our approach. Our experimental evaluation based on real Web pages shows the utility of patterns to improve archive quality and to optimize indexing or storing.

Keywords

Web archiving Importance of page changes Pattern Temporal completeness 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Abiteboul, S., Cobena, G., Masanes, J., Sedrati, G.: A First experience in archiving the French Web. In: Proceedings of the 6th European Conference on Research and Advanced Technology for Digital Libraries (2002)Google Scholar
  2. 2.
    Adar, E., Teevan, J., Dumais, S.T.: Resonance on the web: web dynamics and revisitation patterns. In: Proceedings of the 27th International Conference on Human Factors in Computing Systems, Boston, MA, USA (2009)Google Scholar
  3. 3.
    Adar, E., Teevan, J., Dumais, S.T., Elsas, J.L.: The web changes everything: understanding the dynamics of web content. In: Proceedings of the Second ACM International Conference on Web Search and Data Mining. Barcelona, Spain (2009)Google Scholar
  4. 4.
    Baron, S., Spiliopoulou, M.: Monitoring the evolution of web usage patterns. In: Lecture Notes in Computer Science, pp. 181–200. Springer, New York (2004)Google Scholar
  5. 5.
    Ben Saad, M., Gançarski, S.: Using visual pages analysis for optimizing web archiving. In: EDBT/ICDT PhD Workshops (2010)Google Scholar
  6. 6.
    Bogen, P.L. II, Francisco-Revilla, L., Furuta, R., Hubbard, T., Karadkar, U.P., Shipman, F.: Longitudinal study of changes in blogs. In: Proceedings of the 7th ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL ’07, pp. 135–136. ACM, New York (2007)Google Scholar
  7. 7.
    Bogen, P.L. II, Johnston, J., Karadkar, U.P., Furuta, R., Shipman, F.: Application of kalman filters to identify unexpected change in blogs. In: Proceedings of the 8th ACM/IEEE-CS Joint Conference on Digital Libraries. JCDL ’08, pp. 305–312 (2008)Google Scholar
  8. 8.
    Brewington, B., Cybenko, G.: How dynamic is the web? In: World Wide Web conference (WWW’2000), pp. 257–276 (2000)Google Scholar
  9. 9.
    Brewington B.E., Cybenko G.: Keeping up with the changing web. Computer 33(5), 52–58 (2000)CrossRefGoogle Scholar
  10. 10.
    Cai, D., Yu, S., Wen, J.-R., Ma, W.-Y.: VIPS: a Vision-based Page Segmentation Algorithm. Technical Report, Microsoft Research (2003)Google Scholar
  11. 11.
    Castillo, C., Marin, M., Rodriguez, A., Baeza-Yates, R.: Scheduling algorithms for web crawling. In: LA-WEBMEDIA ’04: Proceedings of the WebMedia & LA-Web 2004 Joint Conference 10th Brazilian Symposium on Multimedia and the Web 2nd Latin American Web Congress, pp. 10–17. (2004)Google Scholar
  12. 12.
    Cathro W.: Development of a digital services architecture at the national library of Australia. EduCause, (2003)Google Scholar
  13. 13.
    Cheng, H., Yan, X., Han, J., wei Hsu, C.: Discriminative frequent pattern analysis for effective classification. In: ICDE, pp. 716–725 (2007)Google Scholar
  14. 14.
    Cho, J., Garcia-Molina, H.: The Evolution of the Web and Implications for an Incremental Crawler. In: VLDB ’00: Proceedings of the 26th International Conference on Very Large Data Bases. (2000)Google Scholar
  15. 15.
    Cho J., Garcia-Molina H.: Effective page refresh policies for web crawlers. ACM Trans. Database Syst. 28(4), 390–426 (2003)CrossRefGoogle Scholar
  16. 16.
    Cho, J., Garcia-Molina, H.: Estimating frequency of change. ACM Trans. Interet Technol. 3(3), (2003)Google Scholar
  17. 17.
    Cho, J., Garcia-molina, H., Page, L.: Efficient crawling through url ordering. In: Computer Networks and ISDN Systems, pp. 161–172 (1998)Google Scholar
  18. 18.
    Denev D., Mazeika A., Spaniol M., Weikum G.: Sharc: framework for quality-conscious web archiving. Proc. VLDB Endow. 2(1), 586–597 (2009)Google Scholar
  19. 19.
    Edwards, J., McCurley, K., Tomlin, J.: An adaptive model for optimizing performance of an incremental web crawler. In: Proceedings of the 10th international conference on World Wide Web, WWW ’01, pp. 106–113 (2001)Google Scholar
  20. 20.
    Francisco-Revilla, L., Shipman, F., Furuta, R., Karadkar, U., Arora, A.: Managing change on the web. In: Proceedings of the 1st ACM/IEEE-CS joint conference on Digital libraries, JCDL ’01, pp. 67–76 (2001)Google Scholar
  21. 21.
    Francisco-Revilla, L., Shipman, F.M. III, Furuta, R., Karadkar, U., Arora, A.: Perception of content, structure, and presentation changes in web-based hypertext. In: Proceedings of the 12th ACM conference on Hypertext and Hypermedia, HYPERTEXT ’01, pp. 205–214 (2001)Google Scholar
  22. 22.
    Gomes, D., Santos, A.L., Silva, M.J.: Managing duplicates in a web archive. In: SAC ’06: Proceedings of the 2006 ACM Symposium on Applied Computing (2006)Google Scholar
  23. 23.
    Gruhl, D., Guha, R., Liben-nowell, D., Tomkins, A.: Information diffusion through blogspace. In: WWW ’04, pp. 491–501 (2004)Google Scholar
  24. 24.
    Han J., Cheng H., Xin D., Yan X.: Frequent pattern mining: current status and future directions. Data Min. Knowl. Disc. 15, 55–86 (2007)MathSciNetCrossRefGoogle Scholar
  25. 25.
    Knotek, J.: Information extraction from advertisements. Master’s thesis, Masaryk University (2011)Google Scholar
  26. 26.
    Lampos, D.J.C., Eirinaki, M., Vazirgiannis, M.: Archiving the greek web. In: 4th International Web Archiving Workshop (IWAW04). Bath, UK (2004)Google Scholar
  27. 27.
    Li, Z., Chen, Z., Srinivasan, S.M., Zhou, Y.: C-miner: Mining block correlations in storage systems. In: Proceedings of the 3rd USENIX Conference on File and Storage Technologies, pp. 173–186 (2004)Google Scholar
  28. 28.
    Li, Z., Lu, S., Myagmar, S., Zhou, Y.: Cp-miner: a tool for finding copy-paste and related bugs in operating system code. In: Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation, vol. 6, pp. 20–20 (2004)Google Scholar
  29. 29.
    Masanès J.: Web Archiving. Springer, Secaucus (2006)CrossRefGoogle Scholar
  30. 30.
    Mazeika, A., Denev, D., Spaniol, M., Weikum, G.: The SOLAR System for Sharp Web Archiving. In: Proceedings of the 10 th International Web Archiving Workshop (IWAW), pp. 24–30. Vienna, Austria, September (2010)Google Scholar
  31. 31.
    Oita, M., Senellart, P.: Archiving data objects using Web feeds. In: Proceedings of the 10 th International Web Archiving Workshop (IWAW), Vienna, Austria, September (2010)Google Scholar
  32. 32.
    Olston, C., Pandey, S.: Recrawl scheduling based on information longevity. In: Proceeding of the 17th International Conference on World Wide Web. WWW ’08, pp. 437–446 (2008)Google Scholar
  33. 33.
    Pandey, S., Olston, C.: User-centric web crawling. In: Proceedings of the 14th International Conference on World Wide Web, WWW ’05, pp. 401–411 (2005)Google Scholar
  34. 34.
    Pehlivan, Z., Ben Saad, M., Gançarski, S.: Vi-diff: Understanding web pages changes. In: 21st International Conference on Database and Expert Systems Applications (DEXA’10). Bilbao, Spain (2010)Google Scholar
  35. 35.
    Saxena, K., Shukla, R.: Significant Interval and Frequent Pattern Discovery in Web Log Data. CoRR, abs/1002.1185 (2010)Google Scholar
  36. 36.
    Sia K.C., Cho J., Cho H.-K.: Efficient monitoring algorithm for fast news alerts. IEEE Tran. Knowl. Data Eng. 19, 950–961 (2007)CrossRefGoogle Scholar
  37. 37.
    Sia, K.C., Cho, J., Hino, K., Chi, Y., Zhu, S., Tseng, B.L.: Monitoring RSS feeds based on user browsing pattern. In: ICWSM ’07: International Conference on Weblogs and Social Media (2007)Google Scholar
  38. 38.
    Song, R., Liu, H., Wen, J.-R., Ma.: Learning block importance models for web pages. In: WWW ’04: Proceedings of the 13th International Conference on World Wide Web. (2004)Google Scholar
  39. 39.
    Spaniol, M., Denev, D., Mazeika, A., Weikum, G., Senellart, P.: Data quality in web archiving. In: WICOW ’09: Proceedings of the 3rd Workshop on Information Credibility on the Web, pp. 19–26 (2009)Google Scholar
  40. 40.
    Spaniol, M., Mazeika, A., Denev, D., Weikum, G.: Catch me if you can: Visual analysis of coherence defects in web archiving. In: 9th International Web Archiving Workshop (IWAW 2009): Workshop Proceecdings, pp. 27–37. Corfu, Greece (2009)Google Scholar
  41. 41.
    Srivastava J., Cooley R., Deshpande M., Tan P.-N.: Web usage mining: discovery and applications of usage patterns from web data. SIGKDD Explor. Newsl. 1, 12–23 (2000)CrossRefGoogle Scholar
  42. 42.
    Ueda, T., Hirate, Y., Yamana, H.: Exploiting idle cpu cores to improve file access performance. In: ICUIMC ’09: Proceedings of the 3rd International Conference on Ubiquitous Information Management and Communication, pp. 529–535 (2009)Google Scholar
  43. 43.
    Vaarandi, R.: A data clustering algorithm for mining patterns from event logs. In: IEEE IPOM’03 Proceedings, pp. 119–126. (2003)Google Scholar
  44. 44.
    Wolf, J.L., Squillante, M.S., Yu, P.S., Sethuraman, J., Ozsen, L.: Optimal crawling strategies for web search engines. In: Proceedings of the 11th International Conference on World Wide Web, WWW ’02, pp. 136–147. (2002)Google Scholar
  45. 45.
    Yang, L.H., Lee, M.L., Hsu, W.: Efficient mining of xml query patterns for caching. In: Proceedings of the 29th International Conference on Very Large Data Bases, VLDB ’2003, vol. 29, pp. 69–80 (2003)Google Scholar

Copyright information

© Springer-Verlag 2012

Authors and Affiliations

  1. 1.LIP6, University Pierre and Marie CurieParisFrance

Personalised recommendations