International Journal on Digital Libraries

, Volume 16, Issue 2, pp 129–144 | Cite as

Evaluating sliding and sticky target policies by measuring temporal drift in acyclic walks through a web archive

Article

Abstract

When viewing an archived page using the archive’s user interface (UI), the user selects a datetime to view from a list. The archived web page, if available, is then displayed. From this display, the web archive UI attempts to simulate the web browsing experience by smoothly transitioning between archived pages. During this process, the target datetime changes with each link followed, potentially drifting away from the datetime originally selected. For sparsely archived resources, this almost transparent drift can be many years in just a few clicks. We conducted 200,000 acyclic walks of archived pages, following up to 50 links per walk, comparing the results of two target datetime policies. The Sliding Target policy allows the target datetime to change as it does in archive UIs such as the Internet Archive’s Wayback Machine. The Sticky Target policy, represented by the Memento API, keeps the target datetime the same throughout the walk. We found that the Sliding Target policy drift increases with the number of walk steps, number of domains visited, and choice (number of links available). However, the Sticky Target policy controls temporal drift, holding it to \(<\)30 days on average regardless of walk length or number of domains visited. The Sticky Target policy shows some increase as choice increases, but this may be caused by other factors. We conclude that based on walk length, the Sticky Target policy generally produces at least 30 days less drift than the Sliding Target policy.

Keywords

Digital preservation HTTP Resource versioning  Temporal applications Web architecture  Web archiving 

References

  1. 1.
    Archive Today personal web archiving service. https://archive.today
  2. 2.
    Ainsworth, S.G., Alsum, A., SalahEldeen, H., Weigle, M.C., Nelson, M.L.: How much of the Web is archived? In: Proceedings of JCDL’11, pp. 133–136 (2011). doi:10.1145/1998076.1998100
  3. 3.
    Ainsworth, S.G., Alsum, A., SalahEldeen, H., Weigle, M.C., Nelson, M.L.: How much of the Web is archived? Tech. Rep. arXiv:1212.6177, Old Dominion University (2012)
  4. 4.
    AlNoamany, Y., Weigle, M.C., Nelson, M.L.: Access patterns for robots and humans in web archives. In: Proceedings of JCDL’13, pp. 339–348 (2013). doi:10.1145/2467696.2467722
  5. 5.
    AlSum, A., Weigle, M.C., Nelson, M.L., de Sompel, H.V.: Profiling web archive coverage for top-level domain and content language. In: Proceedings of TPDL 2013, pp. 60–71 (2013). doi:10.1007/978-3-642-40501-3_7
  6. 6.
    Ben Saad, M., Gançarski, S.: Archiving the Web using page changes patterns: a case study. In: Proceedings of JCDL’11, pp. 113–122 (2011). doi:10.1145/1998076.1998098
  7. 7.
    Ben Saad, M., Gançarski, S.: Improving the quality of web archives through the importance of changes. In: Proceedings of DEXA’11, pp. 394–409 (2011). doi:10.1007/978-3-642-23088-2_29
  8. 8.
    Ben Saad, M., Pehlivan, Z., Gançarski, S.: Coherence-oriented crawling and navigation using patterns for web archives. In: Proceedings of TPDL’11, pp. 421–433 (2011). doi:10.1007/978-3-642-24469-8_42
  9. 9.
    Brunelle, J.F., Nelson, M.L.: Evaluating the SiteStory transactional web archive with the ApacheBench tool. Tech. Rep. arXiv:1209.1811, Old Dominion University (2012)
  10. 10.
    Brunelle, J.F., Nelson, M.L., Balakireva, L., Sanderson, R., Van de Sompel, H.: Evaluating the SiteStory transactional web archive with the ApacheBench tool. In: 17th Annual Conference on the Theory and Practice of Digital Libraries, pp. 204–215 (2012). doi:10.1007/978-3-642-40501-3_20
  11. 11.
    Casey, C.: The Cyberarchive: a look at the storage and preservation of web sites. Coll. Res. Libr 59 (1998). http://crl.acrl.org/content/59/4/304.short
  12. 12.
    Day, M.: Preserving the fabric of our lives: a survey of web preservation initiatives. In: Proceedings of ECDL’05, pp. 461–472 (2003). doi:10.1007/978-3-540-45175-4_42
  13. 13.
    Denev, D., Mazeika, A., Spaniol, M., Weikum, G.: SHARC: framework for quality-conscious web archiving. Proc. VLDB Endow. 2, 586–597 (2009)CrossRefGoogle Scholar
  14. 14.
    Dyreson, C.E., Lin, H.l., Wang, Y.: Managing versions of web documents in a transaction-time web server. In: Proceedings of WWW’04 (2004). doi:10.1145/988672.988730
  15. 15.
    Eysenbach, G., Trudel, M.: Going, going, still there: using the WebCite service to permanently archive cited web pages. J. Med. Internet Res. 7(5) (2005). doi:10.2196/jmir.7.5.e60
  16. 16.
    Fitch., K.: Web site archiving: an approach to recording every materially different response produced by a website. In: 9th Australasian World Wide Web Conference, Sanctuary Cove, Queensland, Australia, pp. 5–9 (2003) Google Scholar
  17. 17.
    Kahle, B.: Wayback machine: now with 240,000,000,000 URLs. http://blog.archive.org/2013/01/09/updated-wayback/ (2013)
  18. 18.
    Kimpton, M., Ubois, J.: Year-by-year: from an archive of the Internet to an archive on the Internet. In: Masanès, J. (ed.) Web archiving, chap. 9, pp. 201–212 (2006). doi:10.1007/978-3-540-46332-0_9
  19. 19.
    Masanès, J.: Web archiving: issues and methods. In: Masanès, J. (ed.) Web archving, chap. 1, pp. 1–53 (2006)Google Scholar
  20. 20.
    McCown, F., Nelson, M.L.: Characterization of search engine caches. In: Proceedings of IS&T Archiving 2007, pp. 48–52 (2007). (Also available as arXiv:cs/0703083v2)
  21. 21.
    Mohr, G., Stack, M., Rnitovic, I., Avery, D., Kimpton, M.: Introduction to Heritrix, an archival quality web crawler. In: Proceedings of IWAW’04 (2004)Google Scholar
  22. 22.
  23. 23.
    Sanderson, R., Shankar, H., Ainsworth, S., McCown, F., Adams, S.: Implementing time travel for the Web. Code4 Lib J. (13) (2011). http://journal.code4lib.org/articles/4979
  24. 24.
    Spaniol, M., Denev, D., Mazeika, A., Weikum, G., Senellart, P.: Data quality in web archiving. In: Proceedings of WICOW’09, pp. 19–26 (2009). doi:10.1145/1526993.1526999
  25. 25.
    Spaniol, M., Mazeika, A., Denev, D., Weikum, G.: Catch me if you can: visual analysis of coherence defects in web archiving. In: Proceedings of IWAW’09, pp. 27–37 (2009)Google Scholar
  26. 26.
    The British Library collection development policy for websites. http://www.bl.uk/aboutus/stratpolprog/digi/webarch/bl_collection_development_policy_v3-0.pdf
  27. 27.
    Thelwall, M., Vaughan, L.: A fair history of the Web? examining country balance in the Internet Archive. Libr. Inf. Sci. Res. 26(2), 162–176 (2004). doi:10.1016/j.lisr.2003.12.009 CrossRefGoogle Scholar
  28. 28.
    Tofel, B.: ‘Wayback’ for accessing web archives. In: Proceedings of IWAW’07) (2007)Google Scholar
  29. 29.
    Van de Sompel, H., Nelson, M., Sanderson, R.: HTTP framework for time-based access to resource states–Memento (IETF RFC 7089) (2013). http://tools.ietf.org/html/rfc7089
  30. 30.
    Van de Sompel, H., Nelson, M.L., Sanderson, R., Balakireva, L.L., Ainsworth, S., Shankar, H.: Memento: time travel for the Web. Tech. Rep. arXiv:0911.1112 (2009)
  31. 31.
    Van de Sompel, H., Sanderson, R., Nelson, M., Balakireva, L., Shankar, H., Ainsworth, S.: An HTTP-based versioning mechanism for linked data. In: Proceedings of LDOW’10 (2010). arXiv:1003:3661
  32. 32.
    Weigle, M.C.: How much of the web is archived? http://ws-dl.blogspot.com/2011/06/2011-06-23-how-much-of-web-is-archived.html (2011)

Copyright information

© Springer-Verlag Berlin Heidelberg 2014

Authors and Affiliations

  1. 1.Old Dominion UniversityNorfolkUSA

Personalised recommendations