International Journal on Digital Libraries

, Volume 16, Issue 3–4, pp 283–301

Not all mementos are created equal: measuring the impact of missing resources

  • Justin F. Brunelle
  • Mat Kelly
  • Hany SalahEldeen
  • Michele C. Weigle
  • Michael L. Nelson
Article

Abstract

Web archives do not always capture every resource on every page that they attempt to archive. This results in archived pages missing a portion of their embedded resources. These embedded resources have varying historic, utility, and importance values. The proportion of missing embedded resources does not provide an accurate measure of their impact on the Web page; some embedded resources are more important to the utility of a page than others. We propose a method to measure the relative value of embedded resources and assign a damage rating to archived pages as a way to evaluate archival success. In this paper, we show that Web users’ perceptions of damage are not accurately estimated by the proportion of missing embedded resources. In fact, the proportion of missing embedded resources is a less accurate estimate of resource damage than a random selection. We propose a damage rating algorithm that provides closer alignment to Web user perception, providing an overall improved agreement with users on memento damage by 17 % and an improvement by 51 % if the mementos have a damage rating delta \(>\)0.30. We use our algorithm to measure damage in the Internet Archive, showing that it is getting better at mitigating damage over time (going from a damage rating of 0.16 in 1998 to 0.13 in 2013). However, we show that a greater number of important embedded resources (2.05 per memento on average) are missing over time. Alternatively, the damage in WebCite is increasing over time (going from 0.375 in 2007 to 0.475 in 2014), while the missing embedded resources remain constant (13 % of the resources are missing on average). Finally, we investigate the impact of JavaScript on the damage of the archives, showing that a crawler that can archive JavaScript-dependent representations will reduce memento damage by 13.5 %.

Keywords

Web architecture Web archiving  Digital preservation Memento damage 

References

  1. 1.
    Ainsworth, S.G., Nelson, M.L.: Evaluating sliding and sticky target policies by measuring temporal drift in acyclic walks through a web archive. Int. J. Digit. Librar. 1–16 (2014). doi:10.1007/s00799-014-0120-4
  2. 2.
    Alnoamany, Y., Alsum, A., Weigle, M., Nelson, M.: Who and what links to the internet archive. In: Proceedings of the Third International Conference on Theory and Practice of Digital Libraries, pp. 346–357. ACM (2013). doi:10.1007/978-3-642-40501-3_35
  3. 3.
    Archive.today: Archive.today (2013). http://archive.today/
  4. 4.
    Ayala, B.R., Phillips, M.E., Ko, L.: Technical report. Current Quality Assurance Practices in Web Archiving (2014)Google Scholar
  5. 5.
    Banos, V., Manolopoulos, Y.: A Quantitative approach to evaluate website archivability using the CLEAR+ Method. Int. J. Digit. Librar. 1–24 (2015). http://link.springer.com/article/10.1007%2Fs00799-015-0144-4
  6. 6.
    Banos, V., Yunhyong, K., Ross, S., Manolopoulos, Y.: CLEAR: A credible method to evaluate website archivability. In: Proceedings of the 9th International Conference on Preservation of Digital Objects (2013)Google Scholar
  7. 7.
    Ben Saad, M., Ganarski, S.: Archiving the web using page changes patterns: A case study. In: Proceedings of the 11th Annual International Joint Conference on Digital Libraries, pp. 113–122 (2011). doi:10.1145/1998076.1998098
  8. 8.
    Ben Saad, M., Ganarski, S.: Archiving the web using page changes patterns: a case study. Int. J. Digit. Libr. 13(1), 33–49 (2012). doi:10.1007/s00799-012-0094-z CrossRefGoogle Scholar
  9. 9.
    Ben Saad, M., Pehlivan, Z., Ganarski, S.: Coherence-oriented crawling and navigation using patterns for web archives. In: Proceedings of the First International Conference on Theory and Practice of Digital Libraries, pp. 421–433 (2011)Google Scholar
  10. 10.
  11. 11.
    Brunelle, J.F.: Fixing links on the live web, breaking them in the archive. http://ws-dl.blogspot.com/2015/02/2015-02-17-fixing-links-on-live-web.html (2015)
  12. 12.
    Brunelle, J.F., Kelly, M., Weigle, M.C., Nelson, M.L.: The Impact of JavaScript on archivability. Int. J. Digit. Libr. 1–23 (2015). doi:10.1007/s00799-015-0140-8
  13. 13.
    Brunelle, J.F., Nelson, M.L.: Zombies in the archives. http://ws-dl.blogspot.com/2012/10/2012-10-10-zombies-in-archives.html (2012)
  14. 14.
    Denev, D., Mazeika, A., Spaniol, M., Weikum, G.: SHARC: framework for quality-conscious web archiving. In: Proceedings of the 35th International Conference on Very Large Data Bases 2, pp. 586–597 (2009). doi:10.1007/s00778-011-0219-9
  15. 15.
    Eysenbach, G., Trudel, M.: Going, going, still there: using the WebCite service to permanently archive cited web pages. J. Med. Internet Res. 7(5) (2005). doi:10.2196/jmir.7.5.e60
  16. 16.
    Fawcett, T.: An introduction to ROC analysis. Pattern Recognit. Lett. 27(8), 861–874 (2006). doi:10.1016/j.patrec.2005.10.010 MathSciNetCrossRefGoogle Scholar
  17. 17.
    Fersini, E., Messina, E., Archetti, F.: Enhancing web page classification through image-block importance analysis. Inf. Process. Manag. 44(4), 1431–1447 (2008). doi:10.1016/j.ipm.2007.11.003 CrossRefGoogle Scholar
  18. 18.
    GNU: Introduction to GNU Wget. http://www.gnu.org/software/wget/ (2013)
  19. 19.
    Gray, G., Martin, S.: Choosing a sustainable web archiving method: A comparison of capture quality. D-Lib Mag. 19(5) (2013). doi:10.1045/may2013-gray
  20. 20.
    Howell, B.A.: Proving web history: how to use the internet archive. J. Internet Law 9(8), 3–9 (2006)Google Scholar
  21. 21.
  22. 22.
    Kelly, M., Brunelle, J.F., Weigle, M.C., Nelson, M.L.: On the change in archivability of websites over time. In: Proceedings of the Third International Conference on Theory and Practice of Digital Libraries, pp. 35–47 (2013). doi:10.1007/978-3-642-40501-3_5
  23. 23.
    Klein, M., Van de Sompel, H., Sanderson, R., Shankar, H., Balakireva, L., Zhou, K., Tobin, R.: Scholarly context not found: one in five articles suffers from reference rot. PLoS One 9(12), e115,253 (2014). doi:10.1371/journal.pone.0115253 CrossRefGoogle Scholar
  24. 24.
    Kohlschütter, C., Fankhauser, P., Nejdl, W.: Boilerplate detection using shallow text features. In: Proceedings of the Third ACM International Conference on Web Search and Data Mining, pp. 441–450 (2010). doi:10.1145/1718487.1718542
  25. 25.
    Marshall, C.C., Shipman, F.M.: On the institutional archiving of social media. In: Proceedings of the 12th ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 1–10 (2012). doi:10.1145/2232817.2232819
  26. 26.
    Mohr, G., Kimpton, M., Stack, M., Ranitovic, I.: Introduction to Heritrix, an archival quality web crawler. In: Proceedings of the 4th International Web Archiving Workshop (2004)Google Scholar
  27. 27.
    Negulescu, K.C.: Web archiving @ the internet archive. Presentation at the 2010 Digital Preservation Partners Meeting, 2010http://www.digitalpreservation.gov/meetings/documents/ndiipp10/NDIIPP072110FinalIA.ppt
  28. 28.
  29. 29.
    Nelson, M.L.: 2014–07-14: ”Refresh” For Zombies, Time Jumps.http://ws-dl.blogspot.com/2014/07/2014-07-14-refresh-for-zombies-time.html (2014)
  30. 30.
    PhantomJS: PhantomJS. http://phantomjs.org/ (2013)
  31. 31.
    Rademacher, P., Lengyel, J., Cutrell, E., Whitted, T.: Measuring the perception of visual realism in images. In: Rendering Techniques 2001, Eurographics, p. 235–247. Springer (2001). doi:10.1007/978-3-7091-6242-2_22
  32. 32.
  33. 33.
    Rossi, A.: Fixing broken links on the internet. https://blog.archive.org/2013/10/25/fixing-broken-links/ (2013)
  34. 34.
    SalahEldeen, H.M., Nelson, M.L.: Losing my revolution: how many resources shared on social media have been lost? In: Proceedings of the Second International Conference on Theory and Practice of Digital Libraries, pp. 125–137 (2012). doi:10.1007/978-3-642-33290-6_14
  35. 35.
    SalahEldeen, H.M., Nelson, M.L.: Reading the correct history?: Modeling temporal intention in resource sharing. In: Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL ’13, pp. 257–266 (2013)Google Scholar
  36. 36.
    SalahEldeen, H.M., Nelson, M.L.: Resurrecting my revolution: Using social link neighborhood in bringing context to the disappearing web. In: Proceedings of the Third International Conference on Theory and Practice of Digital Libraries, pp. 333–345 (2013). doi:10.1007/978-3-642-40501-3_34
  37. 37.
    Sigursson, K.: Incremental crawling with Heritrix. In: Proceedings of the 5th International Web Archiving Workshop (2005)Google Scholar
  38. 38.
    Singh, R., Bhhatarai, B.D.: Information-theoretic identification of content pages for analyzing user information needs and actions on the multimedia web. In: Proceedings of the 2009 ACM Symposium on Applied Computing, pp. 1806–1810 (2009). doi:10.1145/1529282.1529686
  39. 39.
    Song, R., Liu, H., Wen, J.R., Ma, W.Y.: Learning block importance models for web pages. In: Proceedings of the 13th International Conference on World Wide Web, pp. 203–211 (2004). doi:10.1145/988672.988700
  40. 40.
    Spaniol, M., Denev, D., Mazeika, A., Weikum, G., Senellart, P.: Data quality in web archiving. In: Proceedings of the 3rd Workshop on Information Credibility on the Web, pp. 19–26. ACM (2009)Google Scholar
  41. 41.
    Spaniol, M., Mazeika, A., Denev, D., Weikum, G.: Catch me if you can: Visual analysis of coherence defects in web archiving. In: Proceedings of The 9th International Web Archiving Workshop, pp. 27–37 (2009)Google Scholar
  42. 42.
    Sun, Y., Zhuang, Z., Giles, C.L.: A large-scale study of robots.txt. In: Proceedings of the 16th International Conference on World Wide Web, WWW ’07, pp. 1123–1124 (2007)Google Scholar
  43. 43.
    Tofel, B.: ‘Wayback’ for accessing web archives. In: Proceedings of the 7th International Web Archiving Workshop (2007)Google Scholar
  44. 44.
    Van de Sompel, H., Nelson, M.L., Sanderson, R., Balakireva, L.L., Ainsworth, S., Shankar, H.: Memento: Time travel for the websites technical report. arXiv:0911.1112, Los Alamos National Laboratory (2009)
  45. 45.
    Yi, L., Liu, B., Li, X.: Eliminating noisy information in web pages for data mining. In: Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 296–305 (2003). doi:10.1145/956750.956785
  46. 46.
    Zhang, X., Lin, W., Xue, P.: Just-noticeable difference estimation with pixels in images. J. Vis. Commun. Image Represent. 19(1), 30–41 (2008). doi:10.1109/TMM.2013.2268053 CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2015

Authors and Affiliations

  • Justin F. Brunelle
    • 1
  • Mat Kelly
    • 1
  • Hany SalahEldeen
    • 1
  • Michele C. Weigle
    • 1
  • Michael L. Nelson
    • 1
  1. 1.Department of Computer ScienceOld Dominion UniversityNorfolkUSA

Personalised recommendations