Skip to main content

Synthesizing Web Archive Collections into Big Data: Lessons from Mining Data from Web Archives

  • 240 Accesses

Part of the Lecture Notes in Computer Science book series (LNCS,volume 14241)

Abstract

Web archives are sources of big data. When presenting human visitors with archived web pages, or mementos, web archives often apply user interface augmentations to assist them. Unfortunately, these augmentations present challenges for natural language processing, computer vision, and machine learning methods. Thus, big data researchers must apply special techniques to web archives when acquiring mementos. This paper details these techniques so that future projects can more easily create datasets and conduct research. We review 22 web archives and discuss the methods needed to re-synthesize a memento to something close to its original capture without augmentations. We close by discussing options for improving the state of memento sharing for big data efforts.

Keywords

  • Web archive collections
  • WARCs
  • data science
  • big data

This is a preview of subscription content, access via your institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   54.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   69.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    https://webarchive.nla.gov.au/awa/20160511214903/http://pandora.nla.gov.au/pan/157302/20160512-0748/www.charroa.org.au/index.html.

References

  1. Ainsworth, S.G., Alsum, A., SalahEldeen, H., Weigle, M.C., Nelson, M.L.: How much of the web is archived? In: Proceedings of the 11th Annual International ACM/IEEE Joint Conference on Digital Libraries, JCDL 2011, Ottowa, Canada, pp. 133–136. Association for Computing Machinery, New York, NY, USA (2011). https://doi.org/10.1145/1998076.1998100

  2. Alam, S., Kelly, M., Weigle, M.C., Nelson, M.L.: A survey of archival replay banners. Technical report, Old Dominion University (2018). https://matkelly.com/papers/2018_wadl_banners.pdf. Presented at 2018 Web Archiving and Digital Libraries Workshop

  3. Alam, S., Kelly, M., Weigle, M.C., Nelson, M.L.: Unobtrusive and extensible archival replay banners using custom elements. In: Proceedings of the 18th ACM/IEEE on Joint Conference on Digital Libraries, JCDL 2018, pp. 319–320. Association for Computing Machinery, New York, NY, USA (2018). https://doi.org/10.1145/3197026.3203881

  4. AlNoamany, Y.A., Weigle, M.C., Nelson, M.L.: Access patterns for robots and humans in web archives. In: Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL 2013, Indianapolis, Indiana, USA, pp. 339–348. Association for Computing Machinery, New York, NY, USA (2013). https://doi.org/10.1145/2467696.2467722

  5. Arms, W.Y., Aya, S., Dmitriev, P., Kot, B., Mitchell, R., Walle, L.: A research library based on the historical collections of the internet archive. D-Lib Mag. 12 (2006). http://www.dlib.org/dlib/february06/arms/02arms.html

  6. Aturban, M., Nelson, M.L., Weigle, M.C., Klein, M., Van de Sompel, H.: Collecting 16K archived web pages from 17 public web archives. Technical report, 1905.03836, Old Dominion University (2019). https://arxiv.org/abs/1905.03836

  7. Ayala, B.R., Hitchcock, E., Sun, J.: Using image similarity metrics to measure visual quality in web archives. Technical report, University of Alberta (2019). https://doi.org/10.7939/r3-yh2n-rx10. Presented at the 2019 Web Archiving and Digital Libraries Workshop

  8. Ben-David, A.: 2014 not found: a cross-platform approach to retrospective web archiving. Internet Histories 3(3–4), 316–342 (2019). https://doi.org/10.1080/24701475.2019.1654290

    CrossRef  Google Scholar 

  9. Callister, P.D.: Perma.cc and web archival dissonance with copyright law. Legal Ref. Serv. Q. 40(1), 1–57 (2021). https://doi.org/10.1080/0270319X.2021.1886785

  10. Cushman, J., Kreymer, I.: Thinking like a hacker: security considerations for high-fidelity web archives (2017). http://labs.rhizome.org/presentations/security.html

  11. Fielding, R., Reschke, J.: RFC 7231: hypertext transfer protocol (HTTP/1.1): semantics and content (2014). https://datatracker.ietf.org/doc/html/rfc7231

  12. Grusky, M., Naaman, M., Artzi, Y.: NEWSROOM: a dataset of 1.3 million summaries with diverse extractive strategies. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 708–719. Association for Computational Linguistics, New Orleans, Louisiana, June 2018. https://doi.org/10.18653/v1/N18-1065. https://aclanthology.org/N18-1065

  13. Gunnam, M.: How I changed over time: a webservice to summarize TimeMaps based on SimHashed HTML content. Masters Thesis, Old Dominion University (2018). https://www.cs.odu.edu/~mweigle/papers/gunnam-ms-proj-18.pdf

  14. Hafner, K., Palmer, G.: Skin cancers rise, along with questionable treatments. The New York Times (2017). https://www.nytimes.com/2017/11/20/health/dermatology-skin-cancer.html

  15. Holzmann, H., Goel, V., Anand, A.: ArchiveSpark: efficient web archive access, extraction and derivation. In: Proceedings of the 16th ACM/IEEE-CS on Joint Conference on Digital Libraries, pp. 83–92. Newark, New Jersey, USA (2016). https://doi.org/10.1145/2910896.2910902

  16. IIPC and contributors: OpenWayback Administrator Manual (2015). https://iipc.github.io/openwayback/2.1.0.RC.1/administrator_manual.html

  17. IIPC and contributors: IIPC/OpenWayback: The OpenWayback Development (2021). https://github.com/iipc/openwayback

  18. Internet Archive, other contributors: Heritrix (2023). https://github.com/internetarchive/heritrix3

  19. Internet Archive and contributors: Brozzler (2023). https://github.com/internetarchive/brozzler

  20. ISO/TC46/SC4: ISO28500:2017 Information and documentation - WARC file format (2017). https://www.iso.org/standard/68004.html

  21. Jayanetti, H.R., Garg, K., Alam, S., Nelson, M.L., Weigle, M.C.: Robots still outnumber humans in web archives, but less than before. In: Linking Theory and Practice of Digital Libraries, pp. 245–259. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-16802-4_19

  22. Jones, K.S.: A statistical interpretation of term specificity and its application in retrieval. J. Documentation 1, 11–21 (1972). https://doi.org/10.1108/eb026526

    CrossRef  Google Scholar 

  23. Jones, S.M.: Improving collection understanding for web archives with storytelling: shining light into dark and stormy archives. Ph.D. thesis, Old Dominion University (2021). https://doi.org/10.25777/zts6-v512

  24. Jones, S.M., et al.: The DSA toolkit shines light into dark and stormy archives. Code4Lib J. (2022). https://journal.code4lib.org/articles/16441

  25. Jones, S.M., Klein, M., Sompel, H.V., Nelson, M.L., Weigle, M.C.: Interoperability for accessing versions of web resources with the memento protocol. In: The Past Web, pp. 101–126. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-63291-5_9

    CrossRef  Google Scholar 

  26. Jones, S.M., Klein, M., Van de Sompel, H.: Robustifying links to combat reference rot. Code4Lib J. (2021). https://journal.code4lib.org/articles/15509

  27. Jones, S.M., Klein, M., Weigle, M.C., Nelson, M.L.: Summarizing web archive corpora via social media storytelling by automatically selecting and visualizing exemplars. ACM Trans. Web (2023). https://doi.org/10.1145/3606030

    CrossRef  Google Scholar 

  28. Jones, S.M., Shankar, H.: Rules of acquisition for mementos and their content. Technical report, 1602.06223, Los Alamos National Laboratory (2016). https://arxiv.org/abs/1602.06223

  29. Jones, S.M., Van de Sompel, H., Shankar, H., Klein, M., Tobin, R., Grover, C.: Scholarly context adrift: three out of four URI references lead to changed content. PLOS ONE 11(12), 1–32 (2016). https://doi.org/10.1371/journal.pone.0167475

  30. Jones, S.M., Van de Sompel, H., Nelson, M.L.: Mementos in the raw (2016). https://ws-dl.blogspot.com/2016/04/2016-04-27-mementos-in-raw.html

  31. Jones, S.M., Weigle, M.C., Klein, M., Nelson, M.L.: Hypercane: intelligent sampling for web archive collections. In: 2021 ACM/IEEE Joint Conference on Digital Libraries (JCDL), pp. 316–317. ACM, New York, NY, USA (2021). https://doi.org/10.1109/JCDL52503.2021.00049

  32. Jones, S.M., Weigle, M.C., Nelson, M.L.: Hypercane: toolkit for summarizing large collections of archived webpages. ACM SIGWEB Newslett. (Summer), 1–14 (2021). https://doi.org/10.1145/3473044.3473047

  33. Kahle, B.: Wayback Machine now has 898,570,440,000 URL’s. https://twitter.com/brewster_kahle/status/1225167435399036939 (2020)

  34. Kelly, M., Nelson, M.L., Weigle, M.C.: A framework for aggregating private and public web archives. In: Proceedings of the 18th ACM/IEEE on Joint Conference on Digital Libraries, JCDL 2018, Fort Worth, TX, USA, pp. 273–282. Association for Computing Machinery, New York, NY, USA (2018). https://doi.org/10.1145/3197026.3197045

  35. Klein, M., Balakireva, L., Van de Sompel, H.: Focused crawl of web archives to build event collections. In: Proceedings of the 2018 ACM Conference on Web Science, pp. 333–342. Amsterdam, Netherlands (2018). https://doi.org/10.1145/3201064.3201085

  36. Klein, M., et al.: Scholarly context not found: one in five articles suffers from reference rot. PLOS ONE 9(12), 1–39 (2014). https://doi.org/10.1371/journal.pone.0115253

  37. Kreymer, I.: Webrecorder pywb documentation! (2023). https://pywb.readthedocs.io/en/latest/index.html

  38. Kreymer, I.: Contributors: Core Python Web Archiving Toolkit (2023). https://github.com/webrecorder/pywb

  39. Kreymer, I., contributors: Webrecorder (2023). https://webrecorder.net/

  40. Lin, J., Milligan, I., Wiebe, J., Zhou, A.: Warcbase: scalable analytics infrastructure for exploring web archives. J. Comput. Cult. Heritage 10(4), 1–30 (2017). https://doi.org/10.1145/3097570

  41. Milligan, I.: History in the Age of Abundance: How the Web Is Transforming Historical Research. McGill-Queen’s University Press (2019)

    Google Scholar 

  42. Mohr, G., Stack, M., Ranitovic, I., Avery, D., Kimpton, M.: An introduction to Heritrix: an open source archival quality web crawler. In: Proceedings of the 4th International Web Archiving Workshop (IWAW) (2004). https://citeseerx.ist.psu.edu/document?repid=rep1 &type=pdf &doi=7d4e01113bdb8958428a64bc07645444c01d062e

  43. Ohlheiser, A.: Gothamist and DCist just abruptly shut down. What will happen to their archives? The Washington Post (2017). https://www.washingtonpost.com/news/the-intersect/wp/2017/11/02/gothamist-and-dcist-just-abruptly-shut-down-what-will-happen-to-their-archives/

  44. Ruest, N., Lin, J., Milligan, I., Fritz, S.: The archives unleashed project: technology, process, and community to improve scholarly access to web archives. In: Proceedings of the ACM/IEEE Joint Conference on Digital Libraries in 2020, Wuhan, China, pp. 157–166. ACM, New York, NY, USA (2020). https://doi.org/10.1145/3383583.3398513

  45. Snell, J.: Prefer Header for HTTP (2014). https://tools.ietf.org/html/rfc7240

  46. Van de Sompel, H., Nelson, M., Sanderson, R.: RFC 7089 - HTTP Framework for Time-Based Access to Resource States - Memento (2013). https://tools.ietf.org/html/rfc7089

  47. The Memento Project: Memento Aggregator Archive List (2023). http://labs.mementoweb.org/aggregator_config/archivelist.xml

  48. Van de Sompel, H., Nelson, M.L., Balakireva, L., Klein, M., Jones, S.M., Shankar, H.: Mementos in the raw, take two (2016). https://ws-dl.blogspot.com/2016/08/2016-08-15-mementos-in-raw-take-two.html

  49. Weigle, M.C., Nelson, M.L., Alam, S., Graham, M.: Right HTML, wrong JSON: challenges in replaying archived webpages built with client-side rendering. In: Proceedings of the 2023 ACM/IEEE Joint Conference on Digital Libraries, Santa Fe, NM, USA. ACM, New York, NY, USA (2023). Not yet published, preprint available at https://arxiv.org/abs/2305.01071

  50. Weixel, N.: Trump administration changes definition of national stockpile after Kushner remarks (2020). https://thehill.com/homenews/administration/491037-trump-administration-changes-definition-of-national-stockpile-after/

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shawn M. Jones .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Jones, S.M., Jayanetti, H.R., Klein, M., Weigle, M.C., Nelson, M.L. (2023). Synthesizing Web Archive Collections into Big Data: Lessons from Mining Data from Web Archives. In: Alonso, O., Cousijn, H., Silvello, G., Marrero, M., Teixeira Lopes, C., Marchesin, S. (eds) Linking Theory and Practice of Digital Libraries. TPDL 2023. Lecture Notes in Computer Science, vol 14241. Springer, Cham. https://doi.org/10.1007/978-3-031-43849-3_19

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-43849-3_19

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-43848-6

  • Online ISBN: 978-3-031-43849-3

  • eBook Packages: Computer ScienceComputer Science (R0)