Abstract
Web archives are sources of big data. When presenting human visitors with archived web pages, or mementos, web archives often apply user interface augmentations to assist them. Unfortunately, these augmentations present challenges for natural language processing, computer vision, and machine learning methods. Thus, big data researchers must apply special techniques to web archives when acquiring mementos. This paper details these techniques so that future projects can more easily create datasets and conduct research. We review 22 web archives and discuss the methods needed to re-synthesize a memento to something close to its original capture without augmentations. We close by discussing options for improving the state of memento sharing for big data efforts.
Keywords
- Web archive collections
- WARCs
- data science
- big data
This is a preview of subscription content, access via your institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Ainsworth, S.G., Alsum, A., SalahEldeen, H., Weigle, M.C., Nelson, M.L.: How much of the web is archived? In: Proceedings of the 11th Annual International ACM/IEEE Joint Conference on Digital Libraries, JCDL 2011, Ottowa, Canada, pp. 133–136. Association for Computing Machinery, New York, NY, USA (2011). https://doi.org/10.1145/1998076.1998100
Alam, S., Kelly, M., Weigle, M.C., Nelson, M.L.: A survey of archival replay banners. Technical report, Old Dominion University (2018). https://matkelly.com/papers/2018_wadl_banners.pdf. Presented at 2018 Web Archiving and Digital Libraries Workshop
Alam, S., Kelly, M., Weigle, M.C., Nelson, M.L.: Unobtrusive and extensible archival replay banners using custom elements. In: Proceedings of the 18th ACM/IEEE on Joint Conference on Digital Libraries, JCDL 2018, pp. 319–320. Association for Computing Machinery, New York, NY, USA (2018). https://doi.org/10.1145/3197026.3203881
AlNoamany, Y.A., Weigle, M.C., Nelson, M.L.: Access patterns for robots and humans in web archives. In: Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL 2013, Indianapolis, Indiana, USA, pp. 339–348. Association for Computing Machinery, New York, NY, USA (2013). https://doi.org/10.1145/2467696.2467722
Arms, W.Y., Aya, S., Dmitriev, P., Kot, B., Mitchell, R., Walle, L.: A research library based on the historical collections of the internet archive. D-Lib Mag. 12 (2006). http://www.dlib.org/dlib/february06/arms/02arms.html
Aturban, M., Nelson, M.L., Weigle, M.C., Klein, M., Van de Sompel, H.: Collecting 16K archived web pages from 17 public web archives. Technical report, 1905.03836, Old Dominion University (2019). https://arxiv.org/abs/1905.03836
Ayala, B.R., Hitchcock, E., Sun, J.: Using image similarity metrics to measure visual quality in web archives. Technical report, University of Alberta (2019). https://doi.org/10.7939/r3-yh2n-rx10. Presented at the 2019 Web Archiving and Digital Libraries Workshop
Ben-David, A.: 2014 not found: a cross-platform approach to retrospective web archiving. Internet Histories 3(3–4), 316–342 (2019). https://doi.org/10.1080/24701475.2019.1654290
Callister, P.D.: Perma.cc and web archival dissonance with copyright law. Legal Ref. Serv. Q. 40(1), 1–57 (2021). https://doi.org/10.1080/0270319X.2021.1886785
Cushman, J., Kreymer, I.: Thinking like a hacker: security considerations for high-fidelity web archives (2017). http://labs.rhizome.org/presentations/security.html
Fielding, R., Reschke, J.: RFC 7231: hypertext transfer protocol (HTTP/1.1): semantics and content (2014). https://datatracker.ietf.org/doc/html/rfc7231
Grusky, M., Naaman, M., Artzi, Y.: NEWSROOM: a dataset of 1.3 million summaries with diverse extractive strategies. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 708–719. Association for Computational Linguistics, New Orleans, Louisiana, June 2018. https://doi.org/10.18653/v1/N18-1065. https://aclanthology.org/N18-1065
Gunnam, M.: How I changed over time: a webservice to summarize TimeMaps based on SimHashed HTML content. Masters Thesis, Old Dominion University (2018). https://www.cs.odu.edu/~mweigle/papers/gunnam-ms-proj-18.pdf
Hafner, K., Palmer, G.: Skin cancers rise, along with questionable treatments. The New York Times (2017). https://www.nytimes.com/2017/11/20/health/dermatology-skin-cancer.html
Holzmann, H., Goel, V., Anand, A.: ArchiveSpark: efficient web archive access, extraction and derivation. In: Proceedings of the 16th ACM/IEEE-CS on Joint Conference on Digital Libraries, pp. 83–92. Newark, New Jersey, USA (2016). https://doi.org/10.1145/2910896.2910902
IIPC and contributors: OpenWayback Administrator Manual (2015). https://iipc.github.io/openwayback/2.1.0.RC.1/administrator_manual.html
IIPC and contributors: IIPC/OpenWayback: The OpenWayback Development (2021). https://github.com/iipc/openwayback
Internet Archive, other contributors: Heritrix (2023). https://github.com/internetarchive/heritrix3
Internet Archive and contributors: Brozzler (2023). https://github.com/internetarchive/brozzler
ISO/TC46/SC4: ISO28500:2017 Information and documentation - WARC file format (2017). https://www.iso.org/standard/68004.html
Jayanetti, H.R., Garg, K., Alam, S., Nelson, M.L., Weigle, M.C.: Robots still outnumber humans in web archives, but less than before. In: Linking Theory and Practice of Digital Libraries, pp. 245–259. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-16802-4_19
Jones, K.S.: A statistical interpretation of term specificity and its application in retrieval. J. Documentation 1, 11–21 (1972). https://doi.org/10.1108/eb026526
Jones, S.M.: Improving collection understanding for web archives with storytelling: shining light into dark and stormy archives. Ph.D. thesis, Old Dominion University (2021). https://doi.org/10.25777/zts6-v512
Jones, S.M., et al.: The DSA toolkit shines light into dark and stormy archives. Code4Lib J. (2022). https://journal.code4lib.org/articles/16441
Jones, S.M., Klein, M., Sompel, H.V., Nelson, M.L., Weigle, M.C.: Interoperability for accessing versions of web resources with the memento protocol. In: The Past Web, pp. 101–126. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-63291-5_9
Jones, S.M., Klein, M., Van de Sompel, H.: Robustifying links to combat reference rot. Code4Lib J. (2021). https://journal.code4lib.org/articles/15509
Jones, S.M., Klein, M., Weigle, M.C., Nelson, M.L.: Summarizing web archive corpora via social media storytelling by automatically selecting and visualizing exemplars. ACM Trans. Web (2023). https://doi.org/10.1145/3606030
Jones, S.M., Shankar, H.: Rules of acquisition for mementos and their content. Technical report, 1602.06223, Los Alamos National Laboratory (2016). https://arxiv.org/abs/1602.06223
Jones, S.M., Van de Sompel, H., Shankar, H., Klein, M., Tobin, R., Grover, C.: Scholarly context adrift: three out of four URI references lead to changed content. PLOS ONE 11(12), 1–32 (2016). https://doi.org/10.1371/journal.pone.0167475
Jones, S.M., Van de Sompel, H., Nelson, M.L.: Mementos in the raw (2016). https://ws-dl.blogspot.com/2016/04/2016-04-27-mementos-in-raw.html
Jones, S.M., Weigle, M.C., Klein, M., Nelson, M.L.: Hypercane: intelligent sampling for web archive collections. In: 2021 ACM/IEEE Joint Conference on Digital Libraries (JCDL), pp. 316–317. ACM, New York, NY, USA (2021). https://doi.org/10.1109/JCDL52503.2021.00049
Jones, S.M., Weigle, M.C., Nelson, M.L.: Hypercane: toolkit for summarizing large collections of archived webpages. ACM SIGWEB Newslett. (Summer), 1–14 (2021). https://doi.org/10.1145/3473044.3473047
Kahle, B.: Wayback Machine now has 898,570,440,000 URL’s. https://twitter.com/brewster_kahle/status/1225167435399036939 (2020)
Kelly, M., Nelson, M.L., Weigle, M.C.: A framework for aggregating private and public web archives. In: Proceedings of the 18th ACM/IEEE on Joint Conference on Digital Libraries, JCDL 2018, Fort Worth, TX, USA, pp. 273–282. Association for Computing Machinery, New York, NY, USA (2018). https://doi.org/10.1145/3197026.3197045
Klein, M., Balakireva, L., Van de Sompel, H.: Focused crawl of web archives to build event collections. In: Proceedings of the 2018 ACM Conference on Web Science, pp. 333–342. Amsterdam, Netherlands (2018). https://doi.org/10.1145/3201064.3201085
Klein, M., et al.: Scholarly context not found: one in five articles suffers from reference rot. PLOS ONE 9(12), 1–39 (2014). https://doi.org/10.1371/journal.pone.0115253
Kreymer, I.: Webrecorder pywb documentation! (2023). https://pywb.readthedocs.io/en/latest/index.html
Kreymer, I.: Contributors: Core Python Web Archiving Toolkit (2023). https://github.com/webrecorder/pywb
Kreymer, I., contributors: Webrecorder (2023). https://webrecorder.net/
Lin, J., Milligan, I., Wiebe, J., Zhou, A.: Warcbase: scalable analytics infrastructure for exploring web archives. J. Comput. Cult. Heritage 10(4), 1–30 (2017). https://doi.org/10.1145/3097570
Milligan, I.: History in the Age of Abundance: How the Web Is Transforming Historical Research. McGill-Queen’s University Press (2019)
Mohr, G., Stack, M., Ranitovic, I., Avery, D., Kimpton, M.: An introduction to Heritrix: an open source archival quality web crawler. In: Proceedings of the 4th International Web Archiving Workshop (IWAW) (2004). https://citeseerx.ist.psu.edu/document?repid=rep1 &type=pdf &doi=7d4e01113bdb8958428a64bc07645444c01d062e
Ohlheiser, A.: Gothamist and DCist just abruptly shut down. What will happen to their archives? The Washington Post (2017). https://www.washingtonpost.com/news/the-intersect/wp/2017/11/02/gothamist-and-dcist-just-abruptly-shut-down-what-will-happen-to-their-archives/
Ruest, N., Lin, J., Milligan, I., Fritz, S.: The archives unleashed project: technology, process, and community to improve scholarly access to web archives. In: Proceedings of the ACM/IEEE Joint Conference on Digital Libraries in 2020, Wuhan, China, pp. 157–166. ACM, New York, NY, USA (2020). https://doi.org/10.1145/3383583.3398513
Snell, J.: Prefer Header for HTTP (2014). https://tools.ietf.org/html/rfc7240
Van de Sompel, H., Nelson, M., Sanderson, R.: RFC 7089 - HTTP Framework for Time-Based Access to Resource States - Memento (2013). https://tools.ietf.org/html/rfc7089
The Memento Project: Memento Aggregator Archive List (2023). http://labs.mementoweb.org/aggregator_config/archivelist.xml
Van de Sompel, H., Nelson, M.L., Balakireva, L., Klein, M., Jones, S.M., Shankar, H.: Mementos in the raw, take two (2016). https://ws-dl.blogspot.com/2016/08/2016-08-15-mementos-in-raw-take-two.html
Weigle, M.C., Nelson, M.L., Alam, S., Graham, M.: Right HTML, wrong JSON: challenges in replaying archived webpages built with client-side rendering. In: Proceedings of the 2023 ACM/IEEE Joint Conference on Digital Libraries, Santa Fe, NM, USA. ACM, New York, NY, USA (2023). Not yet published, preprint available at https://arxiv.org/abs/2305.01071
Weixel, N.: Trump administration changes definition of national stockpile after Kushner remarks (2020). https://thehill.com/homenews/administration/491037-trump-administration-changes-definition-of-national-stockpile-after/
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Jones, S.M., Jayanetti, H.R., Klein, M., Weigle, M.C., Nelson, M.L. (2023). Synthesizing Web Archive Collections into Big Data: Lessons from Mining Data from Web Archives. In: Alonso, O., Cousijn, H., Silvello, G., Marrero, M., Teixeira Lopes, C., Marchesin, S. (eds) Linking Theory and Practice of Digital Libraries. TPDL 2023. Lecture Notes in Computer Science, vol 14241. Springer, Cham. https://doi.org/10.1007/978-3-031-43849-3_19
Download citation
DOI: https://doi.org/10.1007/978-3-031-43849-3_19
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-43848-6
Online ISBN: 978-3-031-43849-3
eBook Packages: Computer ScienceComputer Science (R0)