Advertisement

Digital Preservation Metadata Practice for Web Archives

  • Clément OuryEmail author
  • Karl-Rainer Blumenthal
  • Sébastien Peyrard
Chapter

Abstract

Twenty years after the pioneering experiments performed by Internet Archive and few national libraries, web archiving has become a common activity of many scientific, cultural, and heritage institutions. They are using a set of tools, generally open source, to identify, harvest, store, index, make available to end users, and preserve internet content over the long term. Institutions seeking to preserve web archives are however facing major challenges: not only the huge amount of collected data, but also the lack of fully reliable metadata, which are crucial to understand the web archives and inform future preservation actions upon them. Web archives are generally stored in container formats, notably the ARC file format and its successor, the WARC format—an ISO standard. Context and Provenance information, generated prior to or as part of the harvesting process, is stored in these container formats, but other metadata—especially information on the formats of the collected files—may be generated afterwards. To store and archive these assets in digital repositories, it is necessary to record and manage their metadata. Therefore, institutions need to make data and metadata modeling choices, which should be consistent not only with the design of their own repository and the kind and amount of data they have to preserve, but also with their conceptual view of the nature of web archives. This paper presents the choices and achievements of the National Library of France, called “container modeling”. It then compares it to the approaches of other members of the International Internet Preservation Consortium and to the projects of the New York Art Resources Consortium. It underlines how the different solutions are implemented with PREMIS and concludes with the use of format identification tools and metadata vocabularies for emulation strategies.

Keywords

National Library Semantic Unit Digital Preservation Digital Repository Provenance Information 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Bonnel S, Oury C (2014) Selecting websites in an encyclopaedic national library: a shared collection policy for internet legal deposit at the BnF. Paper presented at the 80th IFLA WLIC conference, Lyon, France. http://library.ifla.org/998/. Accessed 22 Mar 2016
  2. 2.
    Hockx-Yu, Helen (2014) Archiving social media in the context of non-print legal deposit. Paper presented at the 80th IFLA WLIC conference, Lyon, France. http://library.ifla.org/999/. Accessed 22 Mar 2016
  3. 3.
    Brügger N (2005) Archiving websites: general considerations and strategies. Center for Internetforskning, Aarhus Universitet, Aarhus, 76 ppGoogle Scholar
  4. 4.
    Brügger N (2015) Humanities, digital humanities, media studies, internet studies: an inaugural lecture. Center for Internetforskning, Aarhus Universitet, Aarhus, 16 ppGoogle Scholar
  5. 5.
    Graham S, Milligan I, Weingart S (2015) Exploring big historical data: the historian’s macroscope. Imperial College Press, London, 400 ppGoogle Scholar
  6. 6.
    Illien G, Sanz P, Sepetjan S, Stirling P (2011) The state of e-legal deposit in France: looking back at five years of putting new legislation into practice and envisioning the future. IFLA J 38(1). http://www.ifla.org/files/assets/hq/publications/ifla-journal/ifla-journal-38-1_2012.pdf. Accessed 23 Oct 2016
  7. 7.
    IIPC: International Internet Preservation Consortium (2016) Official web site. http://netpreserve.org. Accessed 01 Feb 2016
  8. 8.
    Consultative Committee for Space Data Systems, International Organization for Standardization (2012) Reference model for an Open Archival Information System (OAIS): issue 2. http://public.ccsds.org/publications/archive/650x0m2.pdf. Accessed 06 Jan 2016
  9. 9.
  10. 10.
    Preservica (2016) http://preservica.com/. Accessed 20 Mar 2016
  11. 11.
    The Internet Archive (2016) Wayback machine. http://archive.org/web. Accessed 01 Feb 2016
  12. 12.
    National Library of New Zealand, British Library (2015) Web curator tool. http://sourceforge.net/projects/webcurator/. Accessed 01 Feb 2016
  13. 13.
    Royal Library of Denmark, State and University Library of Aarhus (2015) NetarchiveSuite. https://sbforge.org/display/NAS/NetarchiveSuite. Accessed 01 Feb 2015
  14. 14.
    The Internet Archive (2016) ArchiveIt! https://Archive-It.org/. Accessed 01 Feb 2016
  15. 15.
    Bermès E, Fauduet L, Peyrard S (2010) A data first approach to digital preservation: the SPAR project. Paper presented at the 76th IFLA general conference and assembly, Gothenburg. http://conference.ifla.org/past-wlic/2010/157-bermes-en.pdf. Accessed 01 Feb 2016
  16. 16.
    Le Follic A, Stirling P, Wendland B (2013) Putting it all together: creating a unified web harvesting workflow at the Bibliothèque nationale de France. http://hal-bnf.archives-ouvertes.fr/docs/00/87/37/59/PDF/Putting_it_all_together.pdf
  17. 17.
    BnF (2009) The WARC file format (ISO 28500): information, maintenance, drafts. http://bibnum.bnf.fr/warc. Accessed 01 Feb 2016
  18. 18.
    Burner M, Kahle B (1996) Arc file format. http://archive.org/web/researcher/ArcFileFormat.php. Accessed 01 Feb 2016
  19. 19.
    Oury C (2010) Large-scale collections under the magnifying glass: format identification for web archives. Paper presented at the 7th international conference on preservation of digital objects (iPRES), Vienna. https://hal-bnf.archives-ouvertes.fr/hal-00769091. Accessed 01 Feb 2016
  20. 20.
    Abrams S, Cramer T, Morrissey S (2009) “What? So What”: the next-generation JHOVE2 architecture for format-aware characterization. Int J Digit Curation 4(3):132–136. doi: 10.2218/ijdc.v4i3.122 CrossRefGoogle Scholar
  21. 21.
  22. 22.
    Open Planets Foundation (2014) JHOVE2: the next-generation architecture for format-aware characterization. https://github.com/opf-labs/jhove2. Accessed 01 Feb 2016
  23. 23.
    Gones J, Oury C, Steinke T (2012) Ensuring long-term access to the memory of the web: preservation working group of the international internet preservation consortium. Int Preserv News 28:34–37. http://www.ifla.org/files/assets/pac/ipn/ipn-58.pdf. Accessed 02 Feb 2016
  24. 24.
    Oury C, Peyrard S (2011) From the World Wide Web to digital library stacks. Paper presented at the 8th international conference on preservation of digital objects (iPRES), Singapore. https://halshs.archives-ouvertes.fr/halshs-00868729. Accessed 02 Feb 2016
  25. 25.
    Wikimedia Foundation (2016) File (command). https://en.wikipedia.org/wiki/File_%28command%29. Accessed 13 Feb 2016
  26. 26.
    Hockx-Yu H (2015) The unknown aspects of web archives. https://hhockx.wordpress.com/2015/08/11/7/. Accessed 22 Mar 2016
  27. 27.
    PREMIS Editorial Committee (2015) PREMIS data dictionary for preservation metadata version 3.0. http://www.loc.gov/standards/premis/v3/premis-3-0-final.pdf. Accessed 05 Jan 2016
  28. 28.
    Bibliothèque nationale de France (2011) ContainerMD. http://bibnum.bnf.fr/containerMD-v1. Accessed 03 Jan 2016
  29. 29.
    Pearson D, Webb C (2008) Defining file format obsolescence: a risk journey. Int J Digit Curation 3(1):89–106. doi: 10.2218/ijdc.v3i1.44 CrossRefGoogle Scholar
  30. 30.
    Jackson A (2015) Ten years of the UK web archive: what have we saved? UK web archive blog. http://britishlibrary.typepad.co.uk/webarchive/2015/09/ten-years-of-the-uk-web-archive-what-have-we-saved.html. Accessed 02 Feb 2016
  31. 31.
    New York Art Resources Consortium (NYARC) (2015) Official website. Accessed 02 Feb 2015Google Scholar
  32. 32.
    Internet Archive Webteam (2015) Archive-it storage and preservation policy. https://web.archive.org/web/20150920070827/https://webarchive.jira.com/wiki/display/ARIH/Archive-It+Storage+and+Preservation+Policy. Accessed 02 Feb 2016
  33. 33.
    Duncan S (2015) Preserving born-digital catalogues raisonnés: web archiving at the New York art resources consortium. Art Librar J 40(2):50–55. http://web.archive.org/web/20151120180335/http://www.nyarc.org/sites/default/files/duncanALJ.pdf. Accessed 02 Feb 2016
  34. 34.
    Persons S (2015) MoMA.org Turns 20: archiving two decades of exhibition sites. INSIDE/OUT, 25 May 2015. http://web.archive.org/web/20150527195327/http://www.moma.org/explore/inside_out/2015/05/25/moma-org-turns-20-archiving-two-decades-of-exhibition-sites?. Accessed 02 Feb 2016
  35. 35.
    Kreymer I (nd) oldweb.today. http://oldweb.today. Accessed 10 Apr 2010
  36. 36.
    University of Illinois at Urbana-Champaign, Grainger Engineering Library Information Center (2006) ECHO Dep METS profile for web site captures. http://www.loc.gov/standards/mets/profiles/00000016.html. Accessed 02 Feb 2016
  37. 37.
    Guenther R, Myrick L (2008) Archiving web sites for preservation and access: MODS, METS and MINERVA. J Arch Organ 4(1–2):141–166. doi: 10.1300/J201v04n01_08 Google Scholar
  38. 38.
  39. 39.
    The National Archives (2016) The technical registry PRONOM. https://www.nationalarchives.gov.uk/PRONOM. Accessed 02 Feb 2016
  40. 40.
    Lehane R (2016) Siegfried. http://www.itforarchivists.com/siegfried. Accessed 02 Feb 2016
  41. 41.
    Rhizome (2016) Welcome to webrecorder beta. https://webrecorder.io. Accessed 02 Feb 2016
  42. 42.
    McKeehan M (2015) Preserving variability. The National Digital Stewardship Residency, New York. http://web.archive.org/web/20151205234527/http://ndsr.nycdigital.org/preserving-variability/. Accessed 02 Feb 2016
  43. 43.
    Goethals A, Oury C, Pearson D, Sierman B, Steinke T (2015) Facing the challenge of web archives preservation collaboratively: the role and work of the IIPC Preservation Working Group. D-Lib Mag. doi: 10.1045/may2015-goethals Google Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  • Clément Oury
    • 1
    Email author
  • Karl-Rainer Blumenthal
    • 2
  • Sébastien Peyrard
    • 3
  1. 1.International ISSN CentreParisFrance
  2. 2.New York Art Resources ConsortiumNew YorkUSA
  3. 3.Bibliothèque nationale de FranceParis Cedex 13France

Personalised recommendations