Skip to main content

Digital Preservation Metadata Practice for Web Archives

  • Chapter
  • First Online:
  • 1830 Accesses

Abstract

Twenty years after the pioneering experiments performed by Internet Archive and few national libraries, web archiving has become a common activity of many scientific, cultural, and heritage institutions. They are using a set of tools, generally open source, to identify, harvest, store, index, make available to end users, and preserve internet content over the long term. Institutions seeking to preserve web archives are however facing major challenges: not only the huge amount of collected data, but also the lack of fully reliable metadata, which are crucial to understand the web archives and inform future preservation actions upon them. Web archives are generally stored in container formats, notably the ARC file format and its successor, the WARC format—an ISO standard. Context and Provenance information, generated prior to or as part of the harvesting process, is stored in these container formats, but other metadata—especially information on the formats of the collected files—may be generated afterwards. To store and archive these assets in digital repositories, it is necessary to record and manage their metadata. Therefore, institutions need to make data and metadata modeling choices, which should be consistent not only with the design of their own repository and the kind and amount of data they have to preserve, but also with their conceptual view of the nature of web archives. This paper presents the choices and achievements of the National Library of France, called “container modeling”. It then compares it to the approaches of other members of the International Internet Preservation Consortium and to the projects of the New York Art Resources Consortium. It underlines how the different solutions are implemented with PREMIS and concludes with the use of format identification tools and metadata vocabularies for emulation strategies.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD   69.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Bonnel S, Oury C (2014) Selecting websites in an encyclopaedic national library: a shared collection policy for internet legal deposit at the BnF. Paper presented at the 80th IFLA WLIC conference, Lyon, France. http://library.ifla.org/998/. Accessed 22 Mar 2016

  2. Hockx-Yu, Helen (2014) Archiving social media in the context of non-print legal deposit. Paper presented at the 80th IFLA WLIC conference, Lyon, France. http://library.ifla.org/999/. Accessed 22 Mar 2016

  3. Brügger N (2005) Archiving websites: general considerations and strategies. Center for Internetforskning, Aarhus Universitet, Aarhus, 76 pp

    Google Scholar 

  4. Brügger N (2015) Humanities, digital humanities, media studies, internet studies: an inaugural lecture. Center for Internetforskning, Aarhus Universitet, Aarhus, 16 pp

    Google Scholar 

  5. Graham S, Milligan I, Weingart S (2015) Exploring big historical data: the historian’s macroscope. Imperial College Press, London, 400 pp

    Google Scholar 

  6. Illien G, Sanz P, Sepetjan S, Stirling P (2011) The state of e-legal deposit in France: looking back at five years of putting new legislation into practice and envisioning the future. IFLA J 38(1). http://www.ifla.org/files/assets/hq/publications/ifla-journal/ifla-journal-38-1_2012.pdf. Accessed 23 Oct 2016

  7. IIPC: International Internet Preservation Consortium (2016) Official web site. http://netpreserve.org. Accessed 01 Feb 2016

  8. Consultative Committee for Space Data Systems, International Organization for Standardization (2012) Reference model for an Open Archival Information System (OAIS): issue 2. http://public.ccsds.org/publications/archive/650x0m2.pdf. Accessed 06 Jan 2016

  9. http://www.exlibrisgroup.com/category/RosettaOverview

  10. Preservica (2016) http://preservica.com/. Accessed 20 Mar 2016

  11. The Internet Archive (2016) Wayback machine. http://archive.org/web. Accessed 01 Feb 2016

  12. National Library of New Zealand, British Library (2015) Web curator tool. http://sourceforge.net/projects/webcurator/. Accessed 01 Feb 2016

  13. Royal Library of Denmark, State and University Library of Aarhus (2015) NetarchiveSuite. https://sbforge.org/display/NAS/NetarchiveSuite. Accessed 01 Feb 2015

  14. The Internet Archive (2016) ArchiveIt! https://Archive-It.org/. Accessed 01 Feb 2016

  15. Bermès E, Fauduet L, Peyrard S (2010) A data first approach to digital preservation: the SPAR project. Paper presented at the 76th IFLA general conference and assembly, Gothenburg. http://conference.ifla.org/past-wlic/2010/157-bermes-en.pdf. Accessed 01 Feb 2016

  16. Le Follic A, Stirling P, Wendland B (2013) Putting it all together: creating a unified web harvesting workflow at the Bibliothèque nationale de France. http://hal-bnf.archives-ouvertes.fr/docs/00/87/37/59/PDF/Putting_it_all_together.pdf

  17. BnF (2009) The WARC file format (ISO 28500): information, maintenance, drafts. http://bibnum.bnf.fr/warc. Accessed 01 Feb 2016

  18. Burner M, Kahle B (1996) Arc file format. http://archive.org/web/researcher/ArcFileFormat.php. Accessed 01 Feb 2016

  19. Oury C (2010) Large-scale collections under the magnifying glass: format identification for web archives. Paper presented at the 7th international conference on preservation of digital objects (iPRES), Vienna. https://hal-bnf.archives-ouvertes.fr/hal-00769091. Accessed 01 Feb 2016

  20. Abrams S, Cramer T, Morrissey S (2009) “What? So What”: the next-generation JHOVE2 architecture for format-aware characterization. Int J Digit Curation 4(3):132–136. doi:10.2218/ijdc.v4i3.122

    Article  Google Scholar 

  21. The National Archives (2016) File profiling tool (DROID). http://www.nationalarchives.gov.uk/information-management/manage-information/policy-process/digital-continuity/file-profiling-tool-droid. Accessed 02 Feb 2016

  22. Open Planets Foundation (2014) JHOVE2: the next-generation architecture for format-aware characterization. https://github.com/opf-labs/jhove2. Accessed 01 Feb 2016

  23. Gones J, Oury C, Steinke T (2012) Ensuring long-term access to the memory of the web: preservation working group of the international internet preservation consortium. Int Preserv News 28:34–37. http://www.ifla.org/files/assets/pac/ipn/ipn-58.pdf. Accessed 02 Feb 2016

  24. Oury C, Peyrard S (2011) From the World Wide Web to digital library stacks. Paper presented at the 8th international conference on preservation of digital objects (iPRES), Singapore. https://halshs.archives-ouvertes.fr/halshs-00868729. Accessed 02 Feb 2016

  25. Wikimedia Foundation (2016) File (command). https://en.wikipedia.org/wiki/File_%28command%29. Accessed 13 Feb 2016

  26. Hockx-Yu H (2015) The unknown aspects of web archives. https://hhockx.wordpress.com/2015/08/11/7/. Accessed 22 Mar 2016

  27. PREMIS Editorial Committee (2015) PREMIS data dictionary for preservation metadata version 3.0. http://www.loc.gov/standards/premis/v3/premis-3-0-final.pdf. Accessed 05 Jan 2016

  28. Bibliothèque nationale de France (2011) ContainerMD. http://bibnum.bnf.fr/containerMD-v1. Accessed 03 Jan 2016

  29. Pearson D, Webb C (2008) Defining file format obsolescence: a risk journey. Int J Digit Curation 3(1):89–106. doi:10.2218/ijdc.v3i1.44

    Article  Google Scholar 

  30. Jackson A (2015) Ten years of the UK web archive: what have we saved? UK web archive blog. http://britishlibrary.typepad.co.uk/webarchive/2015/09/ten-years-of-the-uk-web-archive-what-have-we-saved.html. Accessed 02 Feb 2016

  31. New York Art Resources Consortium (NYARC) (2015) Official website. Accessed 02 Feb 2015

    Google Scholar 

  32. Internet Archive Webteam (2015) Archive-it storage and preservation policy. https://web.archive.org/web/20150920070827/https://webarchive.jira.com/wiki/display/ARIH/Archive-It+Storage+and+Preservation+Policy. Accessed 02 Feb 2016

  33. Duncan S (2015) Preserving born-digital catalogues raisonnés: web archiving at the New York art resources consortium. Art Librar J 40(2):50–55. http://web.archive.org/web/20151120180335/http://www.nyarc.org/sites/default/files/duncanALJ.pdf. Accessed 02 Feb 2016

  34. Persons S (2015) MoMA.org Turns 20: archiving two decades of exhibition sites. INSIDE/OUT, 25 May 2015. http://web.archive.org/web/20150527195327/http://www.moma.org/explore/inside_out/2015/05/25/moma-org-turns-20-archiving-two-decades-of-exhibition-sites?. Accessed 02 Feb 2016

  35. Kreymer I (nd) oldweb.today. http://oldweb.today. Accessed 10 Apr 2010

  36. University of Illinois at Urbana-Champaign, Grainger Engineering Library Information Center (2006) ECHO Dep METS profile for web site captures. http://www.loc.gov/standards/mets/profiles/00000016.html. Accessed 02 Feb 2016

  37. Guenther R, Myrick L (2008) Archiving web sites for preservation and access: MODS, METS and MINERVA. J Arch Organ 4(1–2):141–166. doi:10.1300/J201v04n01_08

    Google Scholar 

  38. Blumenthal KR (2015) Preserving NYARC’s web archives: a steep towards long-term stewardship. http://web.archive.org/web/20150623183326/http://static1.squarespace.com/static/51c07825e4b0b892821e029d/t/5589a643e4b077187933b441/1435084355126/NYARCWebArchiveStorageandPreservation+%281%29.pdf. Accessed 02 Feb 2016

  39. The National Archives (2016) The technical registry PRONOM. https://www.nationalarchives.gov.uk/PRONOM. Accessed 02 Feb 2016

  40. Lehane R (2016) Siegfried. http://www.itforarchivists.com/siegfried. Accessed 02 Feb 2016

  41. Rhizome (2016) Welcome to webrecorder beta. https://webrecorder.io. Accessed 02 Feb 2016

  42. McKeehan M (2015) Preserving variability. The National Digital Stewardship Residency, New York. http://web.archive.org/web/20151205234527/http://ndsr.nycdigital.org/preserving-variability/. Accessed 02 Feb 2016

  43. Goethals A, Oury C, Pearson D, Sierman B, Steinke T (2015) Facing the challenge of web archives preservation collaboratively: the role and work of the IIPC Preservation Working Group. D-Lib Mag. doi:10.1045/may2015-goethals

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Clément Oury .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this chapter

Cite this chapter

Oury, C., Blumenthal, KR., Peyrard, S. (2016). Digital Preservation Metadata Practice for Web Archives. In: Dappert, A., Guenther, R., Peyrard, S. (eds) Digital Preservation Metadata for Practitioners. Springer, Cham. https://doi.org/10.1007/978-3-319-43763-7_6

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-43763-7_6

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-43761-3

  • Online ISBN: 978-3-319-43763-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics