Skip to main content

The Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving

  • Conference paper
  • First Online:
Book cover Digital Libraries for Open Knowledge (TPDL 2019)

Abstract

Web archiving frameworks are commonly assessed by the quality of their archival records and by their ability to operate at scale. The ubiquity of dynamic web content poses a significant challenge for crawler-based solutions such as the Internet Archive that are optimized for scale. Human-driven services such as the Webrecorder tool provide high-quality archival captures but are not optimized to operate at scale. We introduce the Memento Tracer framework that aims to balance archival quality and scalability. We outline its concept and architecture and evaluate its archival quality and operation at scale. Our findings indicate quality is on par or better compared against established archiving frameworks and operation at scale comes with a manageable overhead.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 64.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 84.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://timetravel.mementoweb.org/.

  2. 2.

    http://web.archive.org/web/20190417195948/https://www.cnn.com/.

  3. 3.

    https://webrecorder.io/martinklein/tpdl_test_collection/20190417221002/https://www.cnn.com/.

  4. 4.

    http://memento-damage.cs.odu.edu/.

  5. 5.

    https://www.seleniumhq.org/selenium-ide/.

  6. 6.

    Selenium WebDriver: https://www.seleniumhq.org/.

  7. 7.

    Headless Chrome: https://chromium.googlesource.com/chromium/src/+/lkgr/headless/README.md.

  8. 8.

    WarcProxy: https://github.com/internetarchive/warcprox.

  9. 9.

    For example: https://myresearch.institute/event/e7e8fcc4e8c14392af1c264295d6268a/.

  10. 10.

    https://myresearch.institute/about/.

  11. 11.

    https://www.slideshare.net/explore.

  12. 12.

    A screencast of the Memento Tracer Chrome extension and the interactions with a GitHub repository recorded into a trace is available at: https://doi.org/10.6084/m9.figshare.8049839.v1.

  13. 13.

    The trace is available at: https://doi.org/10.6084/m9.figshare.8024612.

  14. 14.

    The trace is available at: https://doi.org/10.6084/m9.figshare.8024615.

  15. 15.

    https://github.com/GoogleChrome/puppeteer.

References

  1. ISO 28500:2017 - information and documentation - WARC file format. https://www.iso.org/standard/68004.html

  2. United Nations Archives: The National Archives. https://www.nationalarchives.gov.uk/

  3. Berlin, J.: CNN.com Has Been Unarchivable Since November 1st, 2016. https://ws-dl.blogspot.com/2017/01/2017-01-20-cnncom-has-been-unarchivable.html

  4. Berlin, J.A.: To relive the web: a framework for the transformation and archival replay of web pages. Master of Science (MS), Thesis, Computer Science, Old Dominion University (2018)

    Google Scholar 

  5. Brunelle, J.F., Weigle, M.C., Nelson, M.L.: Archival crawlers and Javascript: discover more stuff but crawl more slowly. In: 2017 ACM/IEEE Joint Conference on Digital Libraries (JCDL), pp. 1–10 (2017)

    Google Scholar 

  6. Brunelle, J.F., Kelly, M., SalahEldeen, H., Weigle, M.C., Nelson, M.L.: Not all mementos are created equal: measuring the impact of missing resources. In: Proceedings of the 14th ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 321–330 (2014)

    Google Scholar 

  7. Hidayat, A.: PhantomJS. https://github.com/ariya/phantomjs

  8. Internet Archive: Brozzler. https://github.com/internetarchive/brozzler

  9. Internet Archive: Heritrix web crawler. https://github.com/internetarchive/heritrix3

  10. Internet Archive: Wayback machine. http://web.archive.org/

  11. Kahle, B.: Wayback rising!. https://twitter.com/brewster_kahle/status/1118172506777509890

  12. Kreymer, I.: A prototype of automated web archiving, emulation and server preservation. https://blog.webrecorder.io/2018/08/28/automation-emulation-server-preserve.html

  13. Kreymer, I.: Webrecorder. https://github.com/webrecorder/webrecorder

  14. Kreymer, I.: Webrecorder player. https://github.com/webrecorder/webrecorder-player

  15. National Library of Australia: Trove. https://trove.nla.gov.au/

  16. Poursardar, F., Shipman, F.: How perceptions of web resource boundaries differ for institutional and personal archives. In: 2018 IEEE International Conference on Information Reuse and Integration (IRI), pp. 126–129 (2018)

    Google Scholar 

  17. Reich, V., Rosenthal, D.S.H.: LOCKSS: a permanent web publishing and access system. D-Lib Mag. 7(6) (2001)

    Google Scholar 

  18. Rosenthal, D.S.H., Vargas, D.L., Lipkis, T.A., Griffin, C.T.: Enhancing the LOCKSS digital preservation technology. D-Lib Mag. 21(9/10) (2015). https://doi.org/10.1045/september2015-rosenthal

Download references

Acknowledgement

This work is supported in part by The Andrew W. Mellon Foundation grant 11600663.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Martin Klein .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Klein, M., Shankar, H., Balakireva, L., Van de Sompel, H. (2019). The Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving. In: Doucet, A., Isaac, A., Golub, K., Aalberg, T., Jatowt, A. (eds) Digital Libraries for Open Knowledge. TPDL 2019. Lecture Notes in Computer Science(), vol 11799. Springer, Cham. https://doi.org/10.1007/978-3-030-30760-8_15

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-30760-8_15

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-30759-2

  • Online ISBN: 978-3-030-30760-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics