Advertisement

International Journal on Digital Libraries

, Volume 16, Issue 3–4, pp 267–282 | Cite as

Bridging the gap between real world repositories and scalable preservation environments

  • Bolette Ammitzbøll JurikEmail author
  • Asger Askov Blekinge
  • Rune Bruun Ferneke-Nielsen
  • Per Møldrup-Dalum
Article

Abstract

Integrating large-scale processing environments, such as Hadoop, with traditional repository systems, such as Fedora Commons 3, has long proved to be a daunting task. In this paper, we will show how this integration can be achieved using software developed in the scalable preservation environments (SCAPE) project, and also how it can be achieved using a local more direct implementation at the Danish State and University Library inspired by the SCAPE project. Both allow full use of the Hadoop system for massively distributed processing without causing excessive load on the repository. We present a proof of concept SCAPE integration and an in-production local integration based on repository systems at the Danish State and University Library and the Hadoop execution environment. Both use data from the Newspaper Digitisation Project, a collection that will grow to more than 32 million JP2 images. The use case for the SCAPE integration is to perform feature extraction and validation of the JP2 images. The validation is done against an institutional preservation policy expressed in the machine readable SCAPE Control Policy vocabulary. The feature extraction is done using the Jpylyzer tool. We perform an experiment with various-sized sets of JP2 images, to test the scalability and correctness of the solution. The first use case considered from the local Danish State and University Library integration is also feature extraction and validation of the JP2 images, this time using Jpylyzer and Schematron requirements translated from the project specification by hand. We further look at two other use cases: generation of histograms of the tonal distributions of the images; and generation of dissemination copies. We discuss the challenges and benefits of the two integration approaches when having to perform preservation actions on massive collections stored in traditional digital repositories.

Keywords

Digital preservation Digital repository Preservation action Preservation policies Scalability Integration File characterisation JPEG 2000  Apache Hadoop 

Notes

Acknowledgments

We would like to thank the following for their invaluable help in discussing and proofreading this paper: Bjarne Andersen, Karen Williams, Kåre Fiedler Christiansen and especially Jette Junge. We would also like to thank Tom Gravgaard Christensen and Jens Henrik Leonard Jensen for creating the Hadoop clusters, Kim Teglgaard Christensen for log files and input on the Newspaper Digitisation Project, and last but certainly not least our thanks go to all our colleagues in the SCAPE project for the great discussions on large-scale challenges and solutions; specifically Frank Asseg from FIZ Karlsruhe, Hélder Silva from Keep Solutions and Peter May from British Library who kindly helped us with the references.

References

  1. 1.
    http://hbase.apache.org/ (2014). Accessed Nov 2014
  2. 2.
  3. 3.
    http://hadoop.apache.org (2014). Accessed Mar 2014
  4. 4.
    Asseg, F., Razum, M., Hahn, M.: Apache hadoop as a storage backend for fedora commons. In: OR2012, The 7th International Conference on Open Repositories, Edinburgh. http://or2012.ed.ac.uk/ (2012)
  5. 5.
    Bechhofer, S., Sierman, B., Jones, C., Elstrøm, G., Kulovits, H., Becker, C.: Final version of policy specification model. http://www.scape-project.eu/deliverable/d13-2-catalogue-of-preservation-policy-elements (2014). Accessed May 2015
  6. 6.
    http://sbforge.org/display/BITMAG/The+Bit+Repository+project (2014). Accessed Mar 2014 (Note this is a live wiki page)
  7. 7.
  8. 8.
  9. 9.
    CCSDS Secretariat: Audit and certification of Trustworthy Digital Repositories, Recommended Practice, CCSDS 652.0-M-1, issue 1 edn. CCSDS Secretariat (2011). (Magenta Book) Google Scholar
  10. 10.
  11. 11.
  12. 12.
    Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)CrossRefGoogle Scholar
  13. 13.
    http://dingo.psnc.pl/darceo/ (2014). Accessed Mar 2014
  14. 14.
    http://sbforge.org/display/DOMS/Home (2014). Accessed Mar 2014. (Note this is a live wiki page)
  15. 15.
    http://www.dspace.org (2014). Accessed Mar 2014
  16. 16.
    http://www.eprints.org (2014). Accessed Mar 2014
  17. 17.
    http://www.escidoc.org/ (2014). Accessed Mar 2014
  18. 18.
    Scape connector api on fedora 4. https://github.com/fasseg/fcrepo4-scapex (2014). Accessed Nov 2014
  19. 19.
    http://fedoracommons.org (2014). Accessed Mar 2014
  20. 20.
  21. 21.
  22. 22.
    Ferneke-Nielsen, R.B., Jurik, B.A., Andersen, B., Palmer, W., Pop, D., Duncan, S.S.A., Vujic, I., Klíma, O., Kutner, O., Parkola, T., Asseg, F., Barton, S., Medjkoune, L.: Scape final evaluation and methodology report. http://www.scape-project.eu/deliverable/d18-2-scape-final-evaluation-and-methodology-report (2014). Accessed May 2015
  23. 23.
    Hahn, M.: Recommendations for preservation-aware digital object model. http://www.scape-project.eu/deliverable/d8-1-recommendations-for-preservation-aware-digital-object-model (2014). Accessed May 2015
  24. 24.
    Hahn, M., Asseg, F.: Connector api. http://github.com/openplanets/scape-apis/blob/master/Data_Connector-API_V1.1.pdf (2014). Accessed May 2015
  25. 25.
    Hahn, M., Asseg, F., Sherwinter, N., Castro, R.: Scape data model. https://github.com/openpreserve/scapeapis/blob/master/Digital_Object_Model_V1.0.pdf (2014)
  26. 26.
    http://projecthydra.org (2014). Accessed Mar 2014
  27. 27.
    International Organization for Standardization: Iso/iec 15444–1:2004 information technology—jpeg 2000 image coding system: core coding system. www.iso.org/iso/catalogue_detail.htm?csnumber=37674 (2009). Accessed May 2015
  28. 28.
    http://www.irods.org (2014). Accessed Nov 2014
  29. 29.
    http://islandora.ca (2014). Accessed Mar 2014
  30. 30.
    Jurik, B., Blekinge, A., Ferneke-Nielsen, R., Møldrup-Dalum, P.: Bridging the gap between real world repositories and scalable preservation environments. In: Proceedings Digital Libraries 2014: conjoined conference for both the IEEE/ACM Joint Conference on Digital Libraries and the Theory and Practice of Digital Libraries Conference series (2014)Google Scholar
  31. 31.
    Kakadu software. http://kakadusoftware.com/documentation/ (2014). Accessed Nov 2014
  32. 32.
    van der Knijff, J.: Jpylyzer, jp2 validator and extractor. http://openpreserve.github.io/jpylyzer (2014). Accessed May 2015
  33. 33.
    Kraxner, M., Plangg, M., Duretec, K., Becker, C., Faria, L.: The scape planning and watch suite—supporting the preservation lifecycle in repositories. In: IPRES 2013—Proceedings of the 10th International Conference on Preservation of Digital Objects (2013)Google Scholar
  34. 34.
    Library of Congress: http://www.loc.gov/standards/premis (2014). Accessed Mar 2014
  35. 35.
    Library of Congress: http://www.loc.gov/standards/mets (2014). Accessed Mar 2014
  36. 36.
  37. 37.
    http://netarkivet.dk/in-english (2014). Accessed Mar 2014
  38. 38.
    http://ninestar.co.in (2014). Accessed Mar 2014
  39. 39.
    Palmer, W., Jurik, B., Ferneke-Nielsen, R.B., Kutner, O., Schlarb, S., Neudecker, C., Hahn, M.: Large scale digital repositories executable workflows for large-scale execution. http://www.scape-project.eu/deliverable/d16-2-lsdr-executable-workflows-for-large-scale-execution (2014). Accessed May 2015
  40. 40.
    http://www.roda-community.org/ (2014). Accessed Nov 2014
  41. 41.
    http://www.scape-project.eu (2014). Accessed Mar 2014
  42. 42.
    http://www.ifs.tuwien.ac.at/dp/plato/ (2014). Accessed Nov 2014
  43. 43.
    http://openpreserve.github.io/scout/ (2014). Accessed Nov 2014
  44. 44.
    http://wiki.opf-labs.org/display/SP/SCAPE+Platform (2014). Accessed Nov 2014. (Note this is a live wiki page)
  45. 45.
  46. 46.
    http://ifs.tuwien.ac.at/imp/c3po (2014). Accessed Nov 2014
  47. 47.
    Sheldon, M.: Analysis of current digital preservation policies: Archives, libraries and museums. http://blogs.loc.gov/digitalpreservation/2013/08/analysis-of-current-digital-preservation-policies-archives-libraries-and-museums/ (2013). Accessed May 2015
  48. 48.
    Sierman, B., Jones, C., Bechhofer, S., Elstrøm, G.: Preservation policy levels in scape. In: iPRES 2013—Proceedings of the 10th International Conference on Preservation of Digital Objects (2013)Google Scholar
  49. 49.
    Sierman, B., Jones, C., Elstrøm, G.: Catalogue of preservation policy elements. http://www.scape-project.eu/deliverable/d13-2-catalogue-of-preservation-policy-elements (2014). Accessed Nov 2014
  50. 50.
    State and University Library: Jpeg 2000 specifications for the newspaper collection. http://sbforge.org/display/NEWSPAPER/Appendix+2B+-+JPEG2000+specifications (2013). Accessed Mar 2014
  51. 51.
  52. 52.
  53. 53.
  54. 54.
  55. 55.
  56. 56.
    http://www.taverna.org.uk/ (2014). Accessed Nov 2014
  57. 57.
    Williams, K.: 2.646.800 historiske sider er indtil nu digitaliseret. http://quickpaper.rosendahls.dk/Statsbib/DenGang2 (2014). (In Danish). Accessed May 2015

Copyright information

© Springer-Verlag Berlin Heidelberg 2015

Authors and Affiliations

  • Bolette Ammitzbøll Jurik
    • 1
    Email author
  • Asger Askov Blekinge
    • 1
  • Rune Bruun Ferneke-Nielsen
    • 1
  • Per Møldrup-Dalum
    • 1
  1. 1.State and University LibraryAarhus CDenmark

Personalised recommendations