Advertisement

Springer Nature is making SARS-CoV-2 and COVID-19 research free. View research | View latest news | Sign up for updates

The impact of JavaScript on archivability

Abstract

As web technologies evolve, web archivists work to adapt so that digital history is preserved. Recent advances in web technologies have introduced client-side executed scripts (Ajax) that, for example, load data without a change in top level Universal Resource Identifier (URI) or require user interaction (e.g., content loading via Ajax when the page has scrolled). These advances have made automating methods for capturing web pages more difficult. In an effort to understand why mementos (archived versions of live resources) in today’s archives vary in completeness and sometimes pull content from the live web, we present a study of web resources and archival tools. We used a collection of URIs shared over Twitter and a collection of URIs curated by Archive-It in our investigation. We created local archived versions of the URIs from the Twitter and Archive-It sets using WebCite, wget, and the Heritrix crawler. We found that only 4.2 % of the Twitter collection is perfectly archived by all of these tools, while 34.2 % of the Archive-It collection is perfectly archived. After studying the quality of these mementos, we identified the practice of loading resources via JavaScript (Ajax) as the source of archival difficulty. Further, we show that resources are increasing their use of JavaScript to load embedded resources. By 2012, over half (54.5 %) of pages use JavaScript to load embedded resources. The number of embedded resources loaded via JavaScript has increased by 12.0 % from 2005 to 2012. We also show that JavaScript is responsible for 33.2 % more missing resources in 2012 than in 2005. This shows that JavaScript is responsible for an increasing proportion of the embedded resources unsuccessfully loaded by mementos. JavaScript is also responsible for 52.7 % of all missing embedded resources in our study.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21
Fig. 22
Fig. 23
Fig. 24

Notes

  1. 1.

    http://phantomjs.org/.

  2. 2.

    https://www.webkit.org/.

  3. 3.

    https://dev.twitter.com/docs/streaming-apis/streams/public.

  4. 4.

    http://www.archive-it.org/explore/?show=Collections.

References

  1. 1.

    Access Board: The Rehabilitation Act Amendments (Section 508). http://www.access-board.gov/sec508/guide/act.htm (1998)

  2. 2.

    Ainsworth, S., Alsum, A., SalahEldeen, H., Weigle, M.C., Nelson, M.L.: How much of the Web is archived? In: Proceedings of the 2011 IEEE/ACM Joint Conference on Digital Libraries (JCDL), pp. 133–136 (2011). doi:10.1145/1998076.1998100

  3. 3.

    Antoniades, D., Polakis, I., Kontaxis, G., Athanasopoulos, E., Ioannidis, S., Markatos, E.P., Karagiannis, T.: we.b: the web of short URLs. In: Proceedings of the 20th International Conference on World Wide Web, WWW ’11, pp. 715–724 (2011). doi:10.1145/1963405.1963505

  4. 4.

    Archive.today: Archive.today. http://archive.today/ (2013). http://archive.today/

  5. 5.

    Ast, P., Kapfenberger, M., Hauswiesner, S.: Crawler Approaches And Technology. (online). Graz University of Technology, Styria, Austria (2008). http://www.iicm.tugraz.at/cguetl/courses/isr/uearchive/uews2008/Ue01

  6. 6.

    Banos, V., Yunhyong, K., Ross, S., Manolopoulos, Y.: CLEAR: a credible method to evaluate website archivability. In: Proceedings of the 9th International Conference on Preservation of Digital Objects (2013)

  7. 7.

    Benjamin, K., von Bochmann, G., Dincturk, M., Jourdan, G.V., Onut, I.: A strategy for efficient crawling of rich internet applications. In: Proceedings of Web Engineering, Lecture Notes in Computer Science, vol. 6757, pp. 74–89. Springer, Berlin (2011). doi:10.1007/978-3-642-22233-7_6

  8. 8.

    Benson, E., Marcus, A., Karger, D., Madden, S.: Sync kit: a persistent client-side database caching toolkit for data intensive websites. In: Proceedings of the 19th International Conference on World Wide Web, WWW ’10, pp. 121–130 (2010). doi:10.1145/1772690.1772704

  9. 9.

    Bergman, M.K.: Deep web: Surfacing hidden value. J. Electron. Publ. 7(1) (2001). doi:10.3998/3336451.0007.104

  10. 10.

    Berners-Lee, T.: Information management: a proposal. http://www.w3.org/History/1989/proposal.html (1990)

  11. 11.

    Bragg, M., Rollason-Cass, S.: Archiving Social Networking Sites w/ Archive-It. https://webarchive.jira.com/wiki/pages/viewpage.action?pageId=3113092 (2014)

  12. 12.

    Brunelle, J.F.: Google and JavaScript. http://ws-dl.blogspot.com/2014/06/2014-06-18-google-and-javascript.html (2014)

  13. 13.

    Brunelle, J.F., Kelly, M., SalahEldeen, H., Weigle, M.C., Nelson, M.L.: Not all mementos are created equal: measuring the impact of missing resources. In: Proceedings of the 2014 IEEE/ACM Joint Conference on Digital Libraries (JCDL), pp. 321–330 (2014). doi:10.1109/JCDL.2014.6970187

  14. 14.

    Brunelle, J.F., Kelly, M., SalahEldeen, H., Weigle, M.C., Nelson, M.L.: Not all mementos are created equal: measuring the impact of missing resources. Int. J. Digit. Libr. (2014) (accepted for publication)

  15. 15.

    Brunelle, J.F., Nelson, M.L.: Zombies in the archives. http://ws-dl.blogspot.com/2012/10/2012-10-10-zombies-in-archives.html (2012)

  16. 16.

    Chakrabarti, S., Srivastava, S., Subramanyam, M., Tiwari, M.: Memex: A browsing assistant for collaborative archiving and mining of surf trails. In: Proceedings of the 26th VLDB Conference, 26th VLDB (2000)

  17. 17.

    Chisholm, W., Vanderheiden, G., Jacobs, I.: Web content accessibility guidelines 1.0. Interactions 8(4), 35–54 (2001). doi:10.1145/379537.379550

  18. 18.

    Crook, E.: Web archiving in a Web 2.0 world. In: Proceedings of the Australian Library and Information Association Biennial Conference, pp. 1–9 (2008)

  19. 19.

    Davis, R.C.: Five tips for designing preservable websites. http://blog.photography.si.edu/2011/08/02/five-tips-for-designing-pres-ervable-websites/ (2011)

  20. 20.

    Dincturk, M.E., Jourdan, G.V., Bochmann, G.V., Onut, I.V.: A model-based approach for crawling rich internet applications. ACM Trans. Web 8(3), 19:1–19:39 (2014). doi:10.1145/2626371

  21. 21.

    Duda, C., Frey, G., Kossmann, D., Zhou, C.: AjaxSearch: crawling, indexing and searching Web 2.0 applications. In: The Proceedings of the Very Large Database Endowment (VLDB) Endowment (PVLDB) 1, 1440–1443 (2008). doi:10.14778/1454159.1454195

  22. 22.

    Eysenbach, G., Trudel, M.: Going, going, still there: using the WebCite service to permanently archive cited web pages. J. Med. Internet Res. 7(5) (2005). doi:10.2196/jmir.7.5.e60

  23. 23.

    Fielding, R., Gettys, J., Mogul, J., Frystyk, H., Masinter, L., Leach, P., Berners-Lee, T.: RFC 2616. http://tools.ietf.org/html/rfc2616 (1999)

  24. 24.

    Firefox: Firefox. http://www.mozilla.org/en-US/firefox/new/ (2013)

  25. 25.

    Flanagan, D.: JavaScript: the definitive guide. O’Reilly Media (2001)

  26. 26.

    Fleiss, B.: SEO in the web 2.0 era: the evolution of search engine optimization. http://www.bkv.com/redpapers-media/SEO-in-the-Web-2.0-Era (2007)

  27. 27.

    Fuhrig, L.S.: The Smithsonian: using and archiving Facebook. http://blog.photography.si.edu/2011/05/31/smithsonian-using-and-archiving-facebook/ (2011)

  28. 28.

    Garrett, J., et al.: Ajax: a new approach to web applications. http://www.adaptivepath.com/ideas/ajax-new-approach-web-applications (2005)

  29. 29.

    GNU: Introduction to GNU wget. http://www.gnu.org/software/wget/ (2013)

  30. 30.

    Hackett, S., Parmanto, B., Zeng, X.: Accessibility of internet websites through time. In: Proceedings of the 6th International ACM SIGACCESS Conference on Computers and Accessibility, (77–78), pp. 32–39 (2003). doi:10.1145/1029014.1028638

  31. 31.

    Jack, P.: Extractorhtml extract-javascript. https://webarchive.jira.com/wiki/display/Heritrix/ExtractorHTML+extract-javascript (2014)

  32. 32.

    Jacobs, I., Walsh, N.: Architecture of the world wide web, vol. 1. In: Proceedings of Technical Report W3C Recommendation 15 December 2004, W3C (2004). http://www.w3.org/TR/webarch/

  33. 33.

    Kelly, M., Brunelle, J.F., Weigle, M.C., Nelson, M.L.: On the change in archivability of websites over time. In: Proceedings of the Third International Conference on Theory and Practice of Digital Libraries, pp. 35–47 (2013). doi:10.1007/978-3-642-40501-3_5

  34. 34.

    Kelly, M., Nelson, M.L., Weigle, M.C.: The archival acid test: evaluating archive performance on advanced HTML and JavaScript. In: Proceedings of the 2014 IEEE/ACM Joint Conference on Digital Libraries (JCDL), pp. 25–28 (2014). doi:10.1109/JCDL.2014.6970146

  35. 35.

    Kenney, A.R., McGovern, N.Y., Botticelli, P., Entlich, R., Lagoze, C., Payette, S.: Preservation risk management for web resources. D-Lib Mag. 8(1) (2002). doi:10.1045/january2002-kenney

  36. 36.

    Kiciman, E., Livshits, B.: AjaxScope: a platform for remotely monitoring the client-side behavior of web 2.0 applications. In: Proceedings of The 21st ACM Symposium on Operating Systems Principles, SOSP ’07 (2007). doi:10.1145/1841909.1841910

  37. 37.

    Vikram, K., Prateek, A., Livshits, B.: Ripley: Automatically securing web 2.0 applications through replicated execution. In: Proceedings of the Conference on Computer and Communications Security (2009)

  38. 38.

    Likarish, P., Jung, E.: A targeted web crawling for building malicious javascript collection. In: Proceedings of the ACM First International Workshop on Data-Intensive Software Management and Mining, DSMM ’09, pp. 23–26. ACM, New York (2009). doi:10.1145/1651309.1651317

  39. 39.

    Livshits, B., Guarnieri, S.: Gulfstream: incremental static analysis for streaming JavaScript applications. In: Proceedings of Technical Report MSR-TR-2010-4, Microsoft (2010)

  40. 40.

    Marshall, C.C., Shipman, F.M.: On the institutional archiving of social media. In: Proceedings of the 12th ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 1–10 (2012). doi:10.1145/2232817.2232819

  41. 41.

    McCown, F., Brunelle, J.F.: Warrick. http://warrick.cs.odu.edu/ (2013)

  42. 42.

    McCown, F., Diawara, N., Nelson, M.L.: Factors affecting website reconstruction from the web infrastructure. In: JCDL ’07: Proceedings of the 7th ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 39–48 (2007). doi:10.1145/1255175.1255182

  43. 43.

    McCown, F., Marshall, C.C., Nelson, M.L.: Why web sites are lost (and how they’re sometimes found). Commun. ACM 52(11), 141–145 (2009). doi:10.1145/1592761.1592794

  44. 44.

    McGovern, N.Y., Kenney, A.R., Entlich, R., Kehoe, W.R., Buckley, E.: Virtual remote control. D-Lib Mag. 10(4) (2004). doi:10.1045/april2004-mcgovern

  45. 45.

    Mesbah, A., Bozdag, E., van Deursen, A.: Crawling Ajax by inferring user interface state changes. In: Proceedings of Web Engineering, 2008. ICWE ’08. Eighth International Conference, pp. 122–134 (2008). doi:10.1109/ICWE.2008.24

  46. 46.

    Mesbah, A., van Deursen, A.: An architectural style for ajax. In: Proceedings of Software Architecture, Working IEEE/IFIP Conference, pp. 1–9 (2007). doi:10.1109/WICSA.2007.7

  47. 47.

    Mesbah, A., van Deursen, A.: Migrating multi-page web applications to single-page ajax interfaces. In: Proceedings of the 11th European Conference on Software Maintenance and Reengineering, CSMR ’07, pp. 181–190. IEEE Computer Society, Washington, DC, USA (2007). doi:10.1109/CSMR.2007.33

  48. 48.

    Mesbah, A., van Deursen, A., Lenselink, S.: Crawling ajax-based web applications through dynamic analysis of user interface state changes. ACM Trans. Web 6(1), 3:1–3:30 (2012). doi:10.1145/2109205.2109208

  49. 49.

    Meyerovich, L.A., Livshits, B.: Conscript: Specifying and enforcing fine-grained security policies for javascript in the browser. In: Proceedings of the 2010 IEEE Symposium on Security and Privacy, SP ’10, pp. 481–496. IEEE Computer Society, Washington, DC, USA (2010). doi:10.1109/SP.2010.36

  50. 50.

    Mickens, J., Elson, J., Howell, J.: Mugshot: deterministic capture and replay for JavaScript applications. In: Proceedings of the 7th USENIX Conference on Networked Systems Design and Implementation, NSDI’10, pp. 159–173 (2010)

  51. 51.

    Mohr, G., Kimpton, M., Stack, M., Ranitovic, I.: Introduction to Heritrix, an archival quality web crawler. In: Proceedings of the 4th International Web Archiving Workshop (2004)

  52. 52.

    National Archives and Records Administration: NARA code of federal regulations-36 CFR subchapter B: records management. http://www.archives.gov/about/regulations/subchapter/b.html (2011)

  53. 53.

    National Archives and Records Administration: NARA code of federal regulations-36 CFR subchapter B part 1236: Electronic Records Management. http://www.archives.gov/about/regulations/part-1236.html (2011)

  54. 54.

    Negulescu, K.C.: Web Archiving @ the internet archive. Presentation at the 2010 Digital Preservation Partners Meeting (2010). http://www.digitalpreservation.gov/meetings/documents/ndiipp10/NDIIPP072110FinalIA.ppt

  55. 55.

    Nelson, M.L.: 2013–07-09: Archive.is supports memento. http://ws-dl.blogspot.com/2013/07/2013-07-09-archiveis-supports-memento.html (2013)

  56. 56.

    @NesbittBrian: Play framework sample application with JWebUnit and synchronous ajax (2011). http://nesbot.com/2011/10/16/play-framework-sample-app-JWebUnit-synchronous-ajax

  57. 57.

    Parmanto, B., Zeng, X.: Metric for web accessibility evaluation. J. Am. Soc. Inf. Sci. Technol. 56(13), 1394–1404 (2005). doi:10.1002/asi.20233

  58. 58.

    Pierce, M.E., Fox, G., Yuan, H., Deng, Y.: Cyberinfrastructure and web 2.0. In: Proceedings of High Performance Computing and Grids in Action, pp. 265–287 (2008)

  59. 59.

    Reed, S.: Introduction to Umbra. https://webarchive.jira.com/wiki/display/ARIH/Introduction+to+Umbra (2014)

  60. 60.

    Rosenthal, D.S.H.: Talk on harvesting the future web at IIPC2013. http://blog.dshr.org/2013/04/talk-on-harvesting-future-web-at.ht-ml (2013)

  61. 61.

    Rossi, A.: 80 Terabytes of archived web crawl data available for research. http://blog.archive.org/2012/10/26/80-terabytes-of-archived-web-crawl-data-available-for-research/ (2012)

  62. 62.

    SalahEldeen, H.: Carbon dating the web. http://ws-dl.blogspot.com/2013/04/2013-04-19-carbon-dating-web.html (2013)

  63. 63.

    SalahEldeen, H.M., Nelson, M.L.: Losing my revolution: how many resources shared on social media have been lost? In: Proceedings of the Second international conference on Theory and Practice of Digital Libraries, pp. 125–137 (2012). doi:10.1007/978-3-642-33290-6_14

  64. 64.

    SalahEldeen, H.M., Nelson, M.L.: Resurrecting my revolution: using social link neighborhood in bringing context to the disappearing web. In: Proceedings of the Third International Conference on Theory and Practice of Digital Libraries, pp. 333–345 (2013). doi:10.1007/978-3-642-40501-3_34

  65. 65.

    Sigursson, K.: Incremental crawling with Heritrix. In: Proceedings of the 5th International Web Archiving Workshop (2005)

  66. 66.

    Thibodeau, K.: Building the archives of the future: advances in preserving electronic records at the national archives and records administration. D-Lib Mag. 7(2) (2001). doi:10.1045/february2001-thibodeau. http://www.dlib.org/dlib/february01/thibodeau/02thibodeau.html

  67. 67.

    Tofel, B.: ‘Wayback’ for accessing web archives. In: Proceedings of the 7th International Web Archiving Workshop (2007)

  68. 68.

    Van de Sompel, H., Nelson, M.L., Sanderson, R., Balakireva, L.L., Ainsworth, S., Shankar, H.: Memento: time travel for the web. In: Proceedings of Technical Report, Los Alamos National Laboratory (2009). arXiv:0911.1112

  69. 69.

    W3C staff and working group participants: hash URIs. http://www.w3.org/QA/2011/05/hash_uris.html (2011)

  70. 70.

    Wikipedia: ajax (programming). http://en.wikipedia.org/wiki/Ajax_(programming) (2013)

  71. 71.

    Zucker, D.F.: What does ajax mean for you? Interactions 14, 10–12 (2007). doi:10.1145/1288515.1288523

Download references

Acknowledgments

This work was supported in part by the NSF (IIS 1009392) and the Library of Congress.

Author information

Correspondence to Justin F. Brunelle.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Brunelle, J.F., Kelly, M., Weigle, M.C. et al. The impact of JavaScript on archivability. Int J Digit Libr 17, 95–117 (2016). https://doi.org/10.1007/s00799-015-0140-8

Download citation

Keywords

  • Web architecture
  • Web archiving
  • Digital preservation