The Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving

Klein, Martin; Shankar, Harihar; Balakireva, Lyudmila; Van de Sompel, Herbert

doi:10.1007/978-3-030-30760-8_15

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11799))

Included in the following conference series:

International Conference on Theory and Practice of Digital Libraries

1775 Accesses
8 Citations
17 Altmetric

Abstract

Web archiving frameworks are commonly assessed by the quality of their archival records and by their ability to operate at scale. The ubiquity of dynamic web content poses a significant challenge for crawler-based solutions such as the Internet Archive that are optimized for scale. Human-driven services such as the Webrecorder tool provide high-quality archival captures but are not optimized to operate at scale. We introduce the Memento Tracer framework that aims to balance archival quality and scalability. We outline its concept and architecture and evaluate its archival quality and operation at scale. Our findings indicate quality is on par or better compared against established archiving frameworks and operation at scale comes with a manageable overhead.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 84.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
http://timetravel.mementoweb.org/.
2.
http://web.archive.org/web/20190417195948/https://www.cnn.com/.
3.
https://webrecorder.io/martinklein/tpdl_test_collection/20190417221002/https://www.cnn.com/.
4.
http://memento-damage.cs.odu.edu/.
5.
https://www.seleniumhq.org/selenium-ide/.
6.
Selenium WebDriver: https://www.seleniumhq.org/.
7.
Headless Chrome: https://chromium.googlesource.com/chromium/src/+/lkgr/headless/README.md.
8.
WarcProxy: https://github.com/internetarchive/warcprox.
9.
For example: https://myresearch.institute/event/e7e8fcc4e8c14392af1c264295d6268a/.
10.
https://myresearch.institute/about/.
11.
https://www.slideshare.net/explore.
12.
A screencast of the Memento Tracer Chrome extension and the interactions with a GitHub repository recorded into a trace is available at: https://doi.org/10.6084/m9.figshare.8049839.v1.
13.
The trace is available at: https://doi.org/10.6084/m9.figshare.8024612.
14.
The trace is available at: https://doi.org/10.6084/m9.figshare.8024615.
15.
https://github.com/GoogleChrome/puppeteer.

References

ISO 28500:2017 - information and documentation - WARC file format. https://www.iso.org/standard/68004.html
United Nations Archives: The National Archives. https://www.nationalarchives.gov.uk/
Berlin, J.: CNN.com Has Been Unarchivable Since November 1st, 2016. https://ws-dl.blogspot.com/2017/01/2017-01-20-cnncom-has-been-unarchivable.html
Berlin, J.A.: To relive the web: a framework for the transformation and archival replay of web pages. Master of Science (MS), Thesis, Computer Science, Old Dominion University (2018)
Google Scholar
Brunelle, J.F., Weigle, M.C., Nelson, M.L.: Archival crawlers and Javascript: discover more stuff but crawl more slowly. In: 2017 ACM/IEEE Joint Conference on Digital Libraries (JCDL), pp. 1–10 (2017)
Google Scholar
Brunelle, J.F., Kelly, M., SalahEldeen, H., Weigle, M.C., Nelson, M.L.: Not all mementos are created equal: measuring the impact of missing resources. In: Proceedings of the 14th ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 321–330 (2014)
Google Scholar
Hidayat, A.: PhantomJS. https://github.com/ariya/phantomjs
Internet Archive: Brozzler. https://github.com/internetarchive/brozzler
Internet Archive: Heritrix web crawler. https://github.com/internetarchive/heritrix3
Internet Archive: Wayback machine. http://web.archive.org/
Kahle, B.: Wayback rising!. https://twitter.com/brewster_kahle/status/1118172506777509890
Kreymer, I.: A prototype of automated web archiving, emulation and server preservation. https://blog.webrecorder.io/2018/08/28/automation-emulation-server-preserve.html
Kreymer, I.: Webrecorder. https://github.com/webrecorder/webrecorder
Kreymer, I.: Webrecorder player. https://github.com/webrecorder/webrecorder-player
National Library of Australia: Trove. https://trove.nla.gov.au/
Poursardar, F., Shipman, F.: How perceptions of web resource boundaries differ for institutional and personal archives. In: 2018 IEEE International Conference on Information Reuse and Integration (IRI), pp. 126–129 (2018)
Google Scholar
Reich, V., Rosenthal, D.S.H.: LOCKSS: a permanent web publishing and access system. D-Lib Mag. 7(6) (2001)
Google Scholar
Rosenthal, D.S.H., Vargas, D.L., Lipkis, T.A., Griffin, C.T.: Enhancing the LOCKSS digital preservation technology. D-Lib Mag. 21(9/10) (2015). https://doi.org/10.1045/september2015-rosenthal

Download references

Acknowledgement

This work is supported in part by The Andrew W. Mellon Foundation grant 11600663.

Author information

Authors and Affiliations

Los Alamos National Laboratory, Los Alamos, NM, 87545, USA
Martin Klein, Harihar Shankar & Lyudmila Balakireva
Data Archiving and Networked Services, Anna van Saksenlaan 51, 2593 HW, The Hague, The Netherlands
Herbert Van de Sompel

Authors

Martin Klein
View author publications
You can also search for this author in PubMed Google Scholar
Harihar Shankar
View author publications
You can also search for this author in PubMed Google Scholar
Lyudmila Balakireva
View author publications
You can also search for this author in PubMed Google Scholar
Herbert Van de Sompel
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Martin Klein .

Editor information

Editors and Affiliations

University of La Rochelle, La Rochelle, France
Antoine Doucet
VU University Amsterdam, Amsterdam, The Netherlands
Antoine Isaac
Linnaeus University, Växjö, Sweden
Koraljka Golub
OsloMet – Oslo Metropolitan University, Oslo, Norway
Trond Aalberg
Kyoto University, Kyoto, Japan
Adam Jatowt

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Klein, M., Shankar, H., Balakireva, L., Van de Sompel, H. (2019). The Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving. In: Doucet, A., Isaac, A., Golub, K., Aalberg, T., Jatowt, A. (eds) Digital Libraries for Open Knowledge. TPDL 2019. Lecture Notes in Computer Science(), vol 11799. Springer, Cham. https://doi.org/10.1007/978-3-030-30760-8_15

Download citation

DOI: https://doi.org/10.1007/978-3-030-30760-8_15
Published: 30 August 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-30759-2
Online ISBN: 978-3-030-30760-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics