Bridging the gap between real world repositories and scalable preservation environments

Abstract

Integrating large-scale processing environments, such as Hadoop, with traditional repository systems, such as Fedora Commons 3, has long proved to be a daunting task. In this paper, we will show how this integration can be achieved using software developed in the scalable preservation environments (SCAPE) project, and also how it can be achieved using a local more direct implementation at the Danish State and University Library inspired by the SCAPE project. Both allow full use of the Hadoop system for massively distributed processing without causing excessive load on the repository. We present a proof of concept SCAPE integration and an in-production local integration based on repository systems at the Danish State and University Library and the Hadoop execution environment. Both use data from the Newspaper Digitisation Project, a collection that will grow to more than 32 million JP2 images. The use case for the SCAPE integration is to perform feature extraction and validation of the JP2 images. The validation is done against an institutional preservation policy expressed in the machine readable SCAPE Control Policy vocabulary. The feature extraction is done using the Jpylyzer tool. We perform an experiment with various-sized sets of JP2 images, to test the scalability and correctness of the solution. The first use case considered from the local Danish State and University Library integration is also feature extraction and validation of the JP2 images, this time using Jpylyzer and Schematron requirements translated from the project specification by hand. We further look at two other use cases: generation of histograms of the tonal distributions of the images; and generation of dissemination copies. We discuss the challenges and benefits of the two integration approaches when having to perform preservation actions on massive collections stored in traditional digital repositories.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Notes

  1. 1.

    http://www.emc.com/isilon.

  2. 2.

    http://wiki.duraspace.org/display/FF/Roadmap.

  3. 3.

    http://lucene.apache.org/solr/.

  4. 4.

    http://schematron.com/.

  5. 5.

    http://github.com/statsbiblioteket/SCAPE-jp2-qa.

References

  1. 1.

    http://hbase.apache.org/ (2014). Accessed Nov 2014

  2. 2.

    https://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/lib/NLineInputFormat.html (2015). Accessed Mar 2015

  3. 3.

    http://hadoop.apache.org (2014). Accessed Mar 2014

  4. 4.

    Asseg, F., Razum, M., Hahn, M.: Apache hadoop as a storage backend for fedora commons. In: OR2012, The 7th International Conference on Open Repositories, Edinburgh. http://or2012.ed.ac.uk/ (2012)

  5. 5.

    Bechhofer, S., Sierman, B., Jones, C., Elstrøm, G., Kulovits, H., Becker, C.: Final version of policy specification model. http://www.scape-project.eu/deliverable/d13-2-catalogue-of-preservation-policy-elements (2014). Accessed May 2015

  6. 6.

    http://sbforge.org/display/BITMAG/The+Bit+Repository+project (2014). Accessed Mar 2014 (Note this is a live wiki page)

  7. 7.

    http://www.jisc.ac.uk/media/documents/programmes/digitisation/digitisation_v2_overview_final.pdf (2014). Accessed Nov 2014

  8. 8.

    http://www.britishnewspaperarchive.co.uk/help/about (2014). Accessed Nov 2014

  9. 9.

    CCSDS Secretariat: Audit and certification of Trustworthy Digital Repositories, Recommended Practice, CCSDS 652.0-M-1, issue 1 edn. CCSDS Secretariat (2011). (Magenta Book)

  10. 10.

    http://www.cloudera.com/content/cloudera/en/products-and-services/cloudera-enterprise/cloudera-manager.html (2014). Accessed Mar 2014

  11. 11.

    http://blog.cloudera.com/blog/2009/02/the-small-files-problem/ (2014). Accessed Nov 2014

  12. 12.

    Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)

    Article  Google Scholar 

  13. 13.

    http://dingo.psnc.pl/darceo/ (2014). Accessed Mar 2014

  14. 14.

    http://sbforge.org/display/DOMS/Home (2014). Accessed Mar 2014. (Note this is a live wiki page)

  15. 15.

    http://www.dspace.org (2014). Accessed Mar 2014

  16. 16.

    http://www.eprints.org (2014). Accessed Mar 2014

  17. 17.

    http://www.escidoc.org/ (2014). Accessed Mar 2014

  18. 18.

    Scape connector api on fedora 4. https://github.com/fasseg/fcrepo4-scapex (2014). Accessed Nov 2014

  19. 19.

    http://fedoracommons.org (2014). Accessed Mar 2014

  20. 20.

    http://wiki.duraspace.org/display/FCREPO/Enhanced+Content+Models (2014). Accessed Mar 2014

  21. 21.

    http://wiki.duraspace.org/display/AKUBRA/Akubra+Project (2014). Accessed Nov 2014

  22. 22.

    Ferneke-Nielsen, R.B., Jurik, B.A., Andersen, B., Palmer, W., Pop, D., Duncan, S.S.A., Vujic, I., Klíma, O., Kutner, O., Parkola, T., Asseg, F., Barton, S., Medjkoune, L.: Scape final evaluation and methodology report. http://www.scape-project.eu/deliverable/d18-2-scape-final-evaluation-and-methodology-report (2014). Accessed May 2015

  23. 23.

    Hahn, M.: Recommendations for preservation-aware digital object model. http://www.scape-project.eu/deliverable/d8-1-recommendations-for-preservation-aware-digital-object-model (2014). Accessed May 2015

  24. 24.

    Hahn, M., Asseg, F.: Connector api. http://github.com/openplanets/scape-apis/blob/master/Data_Connector-API_V1.1.pdf (2014). Accessed May 2015

  25. 25.

    Hahn, M., Asseg, F., Sherwinter, N., Castro, R.: Scape data model. https://github.com/openpreserve/scapeapis/blob/master/Digital_Object_Model_V1.0.pdf (2014)

  26. 26.

    http://projecthydra.org (2014). Accessed Mar 2014

  27. 27.

    International Organization for Standardization: Iso/iec 15444–1:2004 information technology—jpeg 2000 image coding system: core coding system. www.iso.org/iso/catalogue_detail.htm?csnumber=37674 (2009). Accessed May 2015

  28. 28.

    http://www.irods.org (2014). Accessed Nov 2014

  29. 29.

    http://islandora.ca (2014). Accessed Mar 2014

  30. 30.

    Jurik, B., Blekinge, A., Ferneke-Nielsen, R., Møldrup-Dalum, P.: Bridging the gap between real world repositories and scalable preservation environments. In: Proceedings Digital Libraries 2014: conjoined conference for both the IEEE/ACM Joint Conference on Digital Libraries and the Theory and Practice of Digital Libraries Conference series (2014)

  31. 31.

    Kakadu software. http://kakadusoftware.com/documentation/ (2014). Accessed Nov 2014

  32. 32.

    van der Knijff, J.: Jpylyzer, jp2 validator and extractor. http://openpreserve.github.io/jpylyzer (2014). Accessed May 2015

  33. 33.

    Kraxner, M., Plangg, M., Duretec, K., Becker, C., Faria, L.: The scape planning and watch suite—supporting the preservation lifecycle in repositories. In: IPRES 2013—Proceedings of the 10th International Conference on Preservation of Digital Objects (2013)

  34. 34.

    Library of Congress: http://www.loc.gov/standards/premis (2014). Accessed Mar 2014

  35. 35.

    Library of Congress: http://www.loc.gov/standards/mets (2014). Accessed Mar 2014

  36. 36.

    http://www.lilyproject.org/lily/index.html (2014). Accessed Mar 2014

  37. 37.

    http://netarkivet.dk/in-english (2014). Accessed Mar 2014

  38. 38.

    http://ninestar.co.in (2014). Accessed Mar 2014

  39. 39.

    Palmer, W., Jurik, B., Ferneke-Nielsen, R.B., Kutner, O., Schlarb, S., Neudecker, C., Hahn, M.: Large scale digital repositories executable workflows for large-scale execution. http://www.scape-project.eu/deliverable/d16-2-lsdr-executable-workflows-for-large-scale-execution (2014). Accessed May 2015

  40. 40.

    http://www.roda-community.org/ (2014). Accessed Nov 2014

  41. 41.

    http://www.scape-project.eu (2014). Accessed Mar 2014

  42. 42.

    http://www.ifs.tuwien.ac.at/dp/plato/ (2014). Accessed Nov 2014

  43. 43.

    http://openpreserve.github.io/scout/ (2014). Accessed Nov 2014

  44. 44.

    http://wiki.opf-labs.org/display/SP/SCAPE+Platform (2014). Accessed Nov 2014. (Note this is a live wiki page)

  45. 45.

    https://github.com/openpreserve/scape-stager-loader-SB (2014). Accessed Nov 2014

  46. 46.

    http://ifs.tuwien.ac.at/imp/c3po (2014). Accessed Nov 2014

  47. 47.

    Sheldon, M.: Analysis of current digital preservation policies: Archives, libraries and museums. http://blogs.loc.gov/digitalpreservation/2013/08/analysis-of-current-digital-preservation-policies-archives-libraries-and-museums/ (2013). Accessed May 2015

  48. 48.

    Sierman, B., Jones, C., Bechhofer, S., Elstrøm, G.: Preservation policy levels in scape. In: iPRES 2013—Proceedings of the 10th International Conference on Preservation of Digital Objects (2013)

  49. 49.

    Sierman, B., Jones, C., Elstrøm, G.: Catalogue of preservation policy elements. http://www.scape-project.eu/deliverable/d13-2-catalogue-of-preservation-policy-elements (2014). Accessed Nov 2014

  50. 50.

    State and University Library: Jpeg 2000 specifications for the newspaper collection. http://sbforge.org/display/NEWSPAPER/Appendix+2B+-+JPEG2000+specifications (2013). Accessed Mar 2014

  51. 51.

    http://www.statsbiblioteket.dk/nationalbibliotek/adgang-til-samlingerne/tv-og-radio/radio-tv (2014). Accessed Mar 2014. (In Danish)

  52. 52.

    http://en.statsbiblioteket.dk/national-library-division/newspaper-digitisation/newspaper-digitization (2014). Accessed May 2015

  53. 53.

    http://www.statsbiblioteket.dk/nationalbibliotek/adgang-til-samlingerne/aviser/StatensAvissamling (2014). Accessed Mar 2014. (In Danish)

  54. 54.

    http://blog.avisdigitalisering.dk/format/#Choosing (2014). Accessed Mar 2014

  55. 55.

    http://sbforge.org/display/NEWSPAPER/Batch+Description (2014). Accessed Nov 2014

  56. 56.

    http://www.taverna.org.uk/ (2014). Accessed Nov 2014

  57. 57.

    Williams, K.: 2.646.800 historiske sider er indtil nu digitaliseret. http://quickpaper.rosendahls.dk/Statsbib/DenGang2 (2014). (In Danish). Accessed May 2015

Download references

Acknowledgments

We would like to thank the following for their invaluable help in discussing and proofreading this paper: Bjarne Andersen, Karen Williams, Kåre Fiedler Christiansen and especially Jette Junge. We would also like to thank Tom Gravgaard Christensen and Jens Henrik Leonard Jensen for creating the Hadoop clusters, Kim Teglgaard Christensen for log files and input on the Newspaper Digitisation Project, and last but certainly not least our thanks go to all our colleagues in the SCAPE project for the great discussions on large-scale challenges and solutions; specifically Frank Asseg from FIZ Karlsruhe, Hélder Silva from Keep Solutions and Peter May from British Library who kindly helped us with the references.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Bolette Ammitzbøll Jurik.

Additional information

This work was partially supported by the SCAPE Project. The SCAPE project is co-funded by the European Union under FP7 ICT-2009.4.1 (Grant Agreement number 270137).

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Jurik, B.A., Blekinge, A.A., Ferneke-Nielsen, R.B. et al. Bridging the gap between real world repositories and scalable preservation environments. Int J Digit Libr 16, 267–282 (2015). https://doi.org/10.1007/s00799-015-0152-4

Download citation

Keywords

  • Digital preservation
  • Digital repository
  • Preservation action
  • Preservation policies
  • Scalability
  • Integration
  • File characterisation
  • JPEG 2000
  • Apache Hadoop