Abstract
Twenty years after the pioneering experiments performed by Internet Archive and few national libraries, web archiving has become a common activity of many scientific, cultural, and heritage institutions. They are using a set of tools, generally open source, to identify, harvest, store, index, make available to end users, and preserve internet content over the long term. Institutions seeking to preserve web archives are however facing major challenges: not only the huge amount of collected data, but also the lack of fully reliable metadata, which are crucial to understand the web archives and inform future preservation actions upon them. Web archives are generally stored in container formats, notably the ARC file format and its successor, the WARC format—an ISO standard. Context and Provenance information, generated prior to or as part of the harvesting process, is stored in these container formats, but other metadata—especially information on the formats of the collected files—may be generated afterwards. To store and archive these assets in digital repositories, it is necessary to record and manage their metadata. Therefore, institutions need to make data and metadata modeling choices, which should be consistent not only with the design of their own repository and the kind and amount of data they have to preserve, but also with their conceptual view of the nature of web archives. This paper presents the choices and achievements of the National Library of France, called “container modeling”. It then compares it to the approaches of other members of the International Internet Preservation Consortium and to the projects of the New York Art Resources Consortium. It underlines how the different solutions are implemented with PREMIS and concludes with the use of format identification tools and metadata vocabularies for emulation strategies.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Bonnel S, Oury C (2014) Selecting websites in an encyclopaedic national library: a shared collection policy for internet legal deposit at the BnF. Paper presented at the 80th IFLA WLIC conference, Lyon, France. http://library.ifla.org/998/. Accessed 22 Mar 2016
Hockx-Yu, Helen (2014) Archiving social media in the context of non-print legal deposit. Paper presented at the 80th IFLA WLIC conference, Lyon, France. http://library.ifla.org/999/. Accessed 22 Mar 2016
Brügger N (2005) Archiving websites: general considerations and strategies. Center for Internetforskning, Aarhus Universitet, Aarhus, 76 pp
Brügger N (2015) Humanities, digital humanities, media studies, internet studies: an inaugural lecture. Center for Internetforskning, Aarhus Universitet, Aarhus, 16 pp
Graham S, Milligan I, Weingart S (2015) Exploring big historical data: the historian’s macroscope. Imperial College Press, London, 400 pp
Illien G, Sanz P, Sepetjan S, Stirling P (2011) The state of e-legal deposit in France: looking back at five years of putting new legislation into practice and envisioning the future. IFLA J 38(1). http://www.ifla.org/files/assets/hq/publications/ifla-journal/ifla-journal-38-1_2012.pdf. Accessed 23 Oct 2016
IIPC: International Internet Preservation Consortium (2016) Official web site. http://netpreserve.org. Accessed 01 Feb 2016
Consultative Committee for Space Data Systems, International Organization for Standardization (2012) Reference model for an Open Archival Information System (OAIS): issue 2. http://public.ccsds.org/publications/archive/650x0m2.pdf. Accessed 06 Jan 2016
Preservica (2016) http://preservica.com/. Accessed 20 Mar 2016
The Internet Archive (2016) Wayback machine. http://archive.org/web. Accessed 01 Feb 2016
National Library of New Zealand, British Library (2015) Web curator tool. http://sourceforge.net/projects/webcurator/. Accessed 01 Feb 2016
Royal Library of Denmark, State and University Library of Aarhus (2015) NetarchiveSuite. https://sbforge.org/display/NAS/NetarchiveSuite. Accessed 01 Feb 2015
The Internet Archive (2016) ArchiveIt! https://Archive-It.org/. Accessed 01 Feb 2016
Bermès E, Fauduet L, Peyrard S (2010) A data first approach to digital preservation: the SPAR project. Paper presented at the 76th IFLA general conference and assembly, Gothenburg. http://conference.ifla.org/past-wlic/2010/157-bermes-en.pdf. Accessed 01 Feb 2016
Le Follic A, Stirling P, Wendland B (2013) Putting it all together: creating a unified web harvesting workflow at the Bibliothèque nationale de France. http://hal-bnf.archives-ouvertes.fr/docs/00/87/37/59/PDF/Putting_it_all_together.pdf
BnF (2009) The WARC file format (ISO 28500): information, maintenance, drafts. http://bibnum.bnf.fr/warc. Accessed 01 Feb 2016
Burner M, Kahle B (1996) Arc file format. http://archive.org/web/researcher/ArcFileFormat.php. Accessed 01 Feb 2016
Oury C (2010) Large-scale collections under the magnifying glass: format identification for web archives. Paper presented at the 7th international conference on preservation of digital objects (iPRES), Vienna. https://hal-bnf.archives-ouvertes.fr/hal-00769091. Accessed 01 Feb 2016
Abrams S, Cramer T, Morrissey S (2009) “What? So What”: the next-generation JHOVE2 architecture for format-aware characterization. Int J Digit Curation 4(3):132–136. doi:10.2218/ijdc.v4i3.122
The National Archives (2016) File profiling tool (DROID). http://www.nationalarchives.gov.uk/information-management/manage-information/policy-process/digital-continuity/file-profiling-tool-droid. Accessed 02 Feb 2016
Open Planets Foundation (2014) JHOVE2: the next-generation architecture for format-aware characterization. https://github.com/opf-labs/jhove2. Accessed 01 Feb 2016
Gones J, Oury C, Steinke T (2012) Ensuring long-term access to the memory of the web: preservation working group of the international internet preservation consortium. Int Preserv News 28:34–37. http://www.ifla.org/files/assets/pac/ipn/ipn-58.pdf. Accessed 02 Feb 2016
Oury C, Peyrard S (2011) From the World Wide Web to digital library stacks. Paper presented at the 8th international conference on preservation of digital objects (iPRES), Singapore. https://halshs.archives-ouvertes.fr/halshs-00868729. Accessed 02 Feb 2016
Wikimedia Foundation (2016) File (command). https://en.wikipedia.org/wiki/File_%28command%29. Accessed 13 Feb 2016
Hockx-Yu H (2015) The unknown aspects of web archives. https://hhockx.wordpress.com/2015/08/11/7/. Accessed 22 Mar 2016
PREMIS Editorial Committee (2015) PREMIS data dictionary for preservation metadata version 3.0. http://www.loc.gov/standards/premis/v3/premis-3-0-final.pdf. Accessed 05 Jan 2016
Bibliothèque nationale de France (2011) ContainerMD. http://bibnum.bnf.fr/containerMD-v1. Accessed 03 Jan 2016
Pearson D, Webb C (2008) Defining file format obsolescence: a risk journey. Int J Digit Curation 3(1):89–106. doi:10.2218/ijdc.v3i1.44
Jackson A (2015) Ten years of the UK web archive: what have we saved? UK web archive blog. http://britishlibrary.typepad.co.uk/webarchive/2015/09/ten-years-of-the-uk-web-archive-what-have-we-saved.html. Accessed 02 Feb 2016
New York Art Resources Consortium (NYARC) (2015) Official website. Accessed 02 Feb 2015
Internet Archive Webteam (2015) Archive-it storage and preservation policy. https://web.archive.org/web/20150920070827/https://webarchive.jira.com/wiki/display/ARIH/Archive-It+Storage+and+Preservation+Policy. Accessed 02 Feb 2016
Duncan S (2015) Preserving born-digital catalogues raisonnés: web archiving at the New York art resources consortium. Art Librar J 40(2):50–55. http://web.archive.org/web/20151120180335/http://www.nyarc.org/sites/default/files/duncanALJ.pdf. Accessed 02 Feb 2016
Persons S (2015) MoMA.org Turns 20: archiving two decades of exhibition sites. INSIDE/OUT, 25 May 2015. http://web.archive.org/web/20150527195327/http://www.moma.org/explore/inside_out/2015/05/25/moma-org-turns-20-archiving-two-decades-of-exhibition-sites?. Accessed 02 Feb 2016
Kreymer I (nd) oldweb.today. http://oldweb.today. Accessed 10 Apr 2010
University of Illinois at Urbana-Champaign, Grainger Engineering Library Information Center (2006) ECHO Dep METS profile for web site captures. http://www.loc.gov/standards/mets/profiles/00000016.html. Accessed 02 Feb 2016
Guenther R, Myrick L (2008) Archiving web sites for preservation and access: MODS, METS and MINERVA. J Arch Organ 4(1–2):141–166. doi:10.1300/J201v04n01_08
Blumenthal KR (2015) Preserving NYARC’s web archives: a steep towards long-term stewardship. http://web.archive.org/web/20150623183326/http://static1.squarespace.com/static/51c07825e4b0b892821e029d/t/5589a643e4b077187933b441/1435084355126/NYARCWebArchiveStorageandPreservation+%281%29.pdf. Accessed 02 Feb 2016
The National Archives (2016) The technical registry PRONOM. https://www.nationalarchives.gov.uk/PRONOM. Accessed 02 Feb 2016
Lehane R (2016) Siegfried. http://www.itforarchivists.com/siegfried. Accessed 02 Feb 2016
Rhizome (2016) Welcome to webrecorder beta. https://webrecorder.io. Accessed 02 Feb 2016
McKeehan M (2015) Preserving variability. The National Digital Stewardship Residency, New York. http://web.archive.org/web/20151205234527/http://ndsr.nycdigital.org/preserving-variability/. Accessed 02 Feb 2016
Goethals A, Oury C, Pearson D, Sierman B, Steinke T (2015) Facing the challenge of web archives preservation collaboratively: the role and work of the IIPC Preservation Working Group. D-Lib Mag. doi:10.1045/may2015-goethals
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this chapter
Cite this chapter
Oury, C., Blumenthal, KR., Peyrard, S. (2016). Digital Preservation Metadata Practice for Web Archives. In: Dappert, A., Guenther, R., Peyrard, S. (eds) Digital Preservation Metadata for Practitioners. Springer, Cham. https://doi.org/10.1007/978-3-319-43763-7_6
Download citation
DOI: https://doi.org/10.1007/978-3-319-43763-7_6
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-43761-3
Online ISBN: 978-3-319-43763-7
eBook Packages: Computer ScienceComputer Science (R0)