Advertisement

Distributed and Parallel Databases

, Volume 36, Issue 1, pp 153–194 | Cite as

MetaStore: an adaptive metadata management framework for heterogeneous metadata models

  • Ajinkya Prabhune
  • Rainer Stotzka
  • Vaibhav Sakharkar
  • Jürgen Hesser
  • Michael Gertz
Article
  • 286 Downloads
Part of the following topical collections:
  1. Special Issue on Large-Scale Data Curation and Metadata Management

Abstract

In this paper, we present MetaStore, a metadata management framework for scientific data repositories. Scientific experiments are generating a deluge of data, and the handling of associated metadata is critical, as it enables discovering, analyzing, reusing, and sharing of scientific data. Moreover, metadata produced by scientific experiments are heterogeneous and subject to frequent changes, demanding a flexible data model. Existing metadata management systems provide a broad range of features for handling scientific metadata. However, the principal limitation of these systems is their architecture design that is restricted towards either a single or at the most a few standard metadata models. Support for handling different types of metadata models, i.e., administrative, descriptive, structural, and provenance metadata, and including community-specific metadata models is not possible with these systems. To address this challenge, we present MetaStore, an adaptive metadata management framework based on a NoSQL database and an RDF triple store. MetaStore provides a set of core functionalities to handle heterogeneous metadata models by automatically generating the necessary software code (services) and on-the-fly extends the functionality of the framework. To handle dynamic metadata and to control metadata quality, MetaStore also provides an extended set of functionalities such as enabling annotation of images and text by integrating the Web Annotation Data Model, allowing communities to define discipline-specific vocabularies using Simple Knowledge Organization System, and providing advanced search and analytical capabilities by integrating the ElasticSearch. To maximize the utilization of the data models supported by NoSQL databases, MetaStore automatically segregates the different categories of metadata in their corresponding data models. Complex provenance graphs and dynamic metadata are modeled and stored in an RDF triple store, whereas the static metadata is stored in a NoSQL database. For enabling large-scale harvesting (sharing) of metadata using the METS standard over the OAI-PMH protocol, MetaStore is designed OAI-compliant. Finally, to show the practical usability of the MetaStore framework and that the requirements from the research communities have been realized, we describe our experience in the adoption of MetaStore for three communities.

Keywords

MetaStore NoSQL database Automated code generation Annotations 

Notes

Acknowledgements

This research is supported by the Portfolio Extension of Helmholtz Association “Large Scale Data Management and Analysis” and DFG (German Research Foundation) MASi Project (STO 397/4-1).

References

  1. 1.
    Hey, T., Trefethen, A.: The Data Deluge: An e-Science Perspective. Wiley and Sons (2003)Google Scholar
  2. 2.
    Gutierrez, D.D.: InsideBIGDATA guide to scientific research. http://insidebigdata.com/2015/12/01/insidebigdata-guide-to-scientific-research/. Accessed 9 June 2017
  3. 3.
    Berry, D., Parastatidis, S.: e-Science workflow services workshop, December 2003. http://www.nesc.ac.uk/esi/events/303/index.html. Accessed 10 June 2017
  4. 4.
    Gannon, D., Fox, G., Farazdel, A., Goble, C., Deelman, E., Berry, D.: Workflow in grid systems workshop, March 2004. http://www.extreme.indiana.edu/groc/Worflow-call.html. Accessed 16 June 2017
  5. 5.
    Jacob, J., Katz, D., Miller, C., et al.: GRIST workshop on service composition for data exploration in the virtual observatory, July 2004. http://www.roe.ac.uk/~rgm/sc4devo/sc4devo1/index.html. Accessed 10 June 2017
  6. 6.
    LINK-Up Workshop on Scientific Workflows, October 2004. http://kbis.sdsc.edu/events/link-up-11-04/. Accessed 16 June 2017
  7. 7.
    Deelman, E., Gil, Y., Zemankova, M.: NSF Workshop on the Challenges of Scientific Workflows, May 2006. https://www.nsf.gov/events/event$_$summ.jsp?cntn$_$id=108411. Accessed 16 June 2017
  8. 8.
    Gray, J., Liu, D.T., Nieto-Santisteban, M., Szalay, A., DeWitt, D.J., Heber, G.: Scientific data management in the coming decade. SIGMOD Rec. 34(4), 34–41 (2005)CrossRefGoogle Scholar
  9. 9.
    Graybeal, J., Miller, S.P., Stocks, K.: The MMI guides: navigating the world of marine metadata. http://uop.whoi.edu/techdocs/presentations/MMI_Guides.pdf (2010). Accessed 15 June 2017
  10. 10.
    Lemmer, P., Gunkel, M., Baddeley, D., Kaufmann, R., Urich, A., Weiland, Y., Reymann, J., Müller, P., Hausmann, M., Cremer, C.: SPDM: light microscopy with single-molecule resolution at the nanoscale. Appl. Phys. B 93(1), 1 (2008)CrossRefGoogle Scholar
  11. 11.
    National Information Standards Organization: Understanding Metadata, NISO Press, Bethesda http://www.niso.org/publications/press/understanding_metadata (2004). Accessed 15 May 2017
  12. 12.
    Dimitrovski, I., Kocev, D., Loskovska, S., Džeroski, S.: Hierarchical annotation of medical images. Patt. Recogn. 44(1011), 2436–2449 (2011)CrossRefGoogle Scholar
  13. 13.
    Hu, B., Dasmahapatra, S., Lewis, P., Shadbolt, N.: Ontology-based medical image annotation with description logics. In: Proceedings of 15th IEEE International Conference on Tools with Artificial Intelligence, pp. 77–82 (2003)Google Scholar
  14. 14.
    Blanke, T., Hedges, M., Dunn, S.: Arts and humanities e-science: current practices and future challenges. Fut. Gener. Comput. Syst. 25(4), 474–480 (2009)CrossRefGoogle Scholar
  15. 15.
    Gao, S., Sperberg-McQueen, C.M., Thompson, H.S., Mendelsohn, N., Beech, D., Maloney, M.: W3C XML schema definition language (XSD) 1.1 part 1: structures. W3C Candidate Recommendation 30(7.2) (2009)Google Scholar
  16. 16.
    Higgins, D., Berkley, C., Jones, M. B.: Managing heterogeneous ecological data using Morpho. In: Proceedings 14th International Conference on Scientific and Statistical Database Management, pp. 69–76 (2002)Google Scholar
  17. 17.
    Frew, J., Bose, R.: Earth system science workbench: a data management infrastructure for earth science products. In: Proceedings Thirteenth International Conference on Scientific and Statistical Database Management. SSDBM, pp. 180–189 (2001)Google Scholar
  18. 18.
    Pancerella, C., Hewson, J., et al: Metadata in the collaboratory for multi-scale chemical science. In: International Conference on Dublin Core and Metadata Applications (2003)Google Scholar
  19. 19.
    Malet, G., Munoz, F., Appleyard, R., Hersh, W.: A model for enhancing internet medical document retrieval with medical core metadata. J. Am. Med. Inf. Assoc. 6(2), 163 (1999)CrossRefGoogle Scholar
  20. 20.
    Prabhune, A., Ansari, H., Keshav, A., Stotzka, R., Gertz, M., Hesser, J.: Metastore: a metadata framework for scientific data repositories. In: IEEE International Conference on Big Data (Big Data), pp. 3026–3035 (2016)Google Scholar
  21. 21.
    Cuevas-Vicenttín, V., Ludäscher, B,. Missier, P., Belhajjame, K., Chirigati, F., Wei, Y., Dey, S., Kianmajd, P., Koop, D., Bowers, S., et al.: ProvONE: a PROV extension data model for scientific workflow provenance (2015)Google Scholar
  22. 22.
    PREMIS Working Group et al.: Data dictionary for preservation metadata: final report of the premis working group. OCLC Online Computer Library Center & Research Libraries Group, Dublin, OH, USA, Final report (2005)Google Scholar
  23. 23.
    Lagoze, C., Van de Sompel, H., Nelson, M., Warner, S.: The open archives initiative protocol for metadata harvesting-version 2.0 (2002)Google Scholar
  24. 24.
    McDonough, J.P.: METS: standardized encoding for digital library objects. Int. J. Digit. Libr. 6(2), 148–158 (2006)CrossRefGoogle Scholar
  25. 25.
    Miles, A., Matthews, B., Wilson, M., Brickley, D.: SKOS core: simple knowledge organisation for the web. In: International Conference on Dublin Core and Metadata Applications, pp. 3–10 (2005)Google Scholar
  26. 26.
    Gormley, C., Tong, Z.: Elasticsearch: The Definitive Guide. O’Reilly Media, Inc., Sebastopol (2015)Google Scholar
  27. 27.
    Apache Jena. A free and open source java framework for building semantic web and linked data applications. https://jena.apache.org. Accessed 15 March 2017
  28. 28.
    Prabhune, A., Zweig, A., Stotzka, R., Gertz, M., Hesser, J.: Prov2ONE: An Algorithm for Automatically Constructing ProvONE Provenance Graphs, pp. 204–208. Springer International Publishing (2016)Google Scholar
  29. 29.
    Carlson, J.L.: Redis in Action. Manning Publications Co., Greenwich (2013)Google Scholar
  30. 30.
    Banker, K.: MongoDB in Action. Manning Publications Co., Greenwich (2011)Google Scholar
  31. 31.
    Vukotic, A., Watt, N., Abedrabbo, T., Fox, D., Partner, J.: Neo4j in Action. Manning Publications Co., Greenwich (2015)Google Scholar
  32. 32.
    Chandna, S., Rindone, F., Dachsbacher, C., Stotzka, R.: Quantitative exploration of large medieval manuscripts data for the codicological research. In: 2016 IEEE 6th Symposium on Large Data Analysis and Visualization (LDAV), pp. 20–28 (2016)Google Scholar
  33. 33.
    McKinley, P.K., Sadjadi, S.M., Kasten, E.P., Cheng, B.H.: Composing adaptive software. Computer 37(7), 56–64 (2004)CrossRefGoogle Scholar
  34. 34.
    OASIS. Web services business process execution language version 2.0. http://docs.oasis-open.org/wsbpel/2.0/OS/wsbpel-v2.0-OS.html (2007)
  35. 35.
    Wolstencroft, K., Haines, R., Fellows, D., Williams, A., Withers, D., Owen, S., Bhagat, J.: The Taverna workflow suite: designing and executing workflows of Web Services on the desktop, web or in the cloud. Nucl Acids Res 41, W557–W561 (2013)CrossRefGoogle Scholar
  36. 36.
    Lee, E.A., Neuendorffer, S.: MoML: a modeling markup language in SML: version 0.4. Electronics Research Laboratory, University of California (2000)Google Scholar
  37. 37.
    Prud, E., Seaborne, A., et al.: SPARQL query language for RDF. http://www.w3.org/TR/rdf-sparql-query/, Accessed 15 March 2017
  38. 38.
    Zhao, Y., Wilde, M., Foster, I.: Applying the Virtual Data Provenance Model. Springer, Berlin (2006)CrossRefGoogle Scholar
  39. 39.
    Moreau, L., Clifford, B., Freire, J., Futrelle, J., Gil, Y., Groth, P., Kwasnikowska, N., Miles, S., Missier, P., Myers, J., et al.: The open provenance model core specification (v1. 1). Fut. Gener. Comput. Syst. 27(6), 743–756 (2011)CrossRefGoogle Scholar
  40. 40.
    Sahoo, S., Groth, P., Hartig, S.M., Miles, S., Gil, Y., Myers, J., Moreau, L., Panzer, M., Zhao, J., Garijo, D.: Provenance Vocabulary Mappings. W3C Provenance Incubator Group (2010)Google Scholar
  41. 41.
    Weibel, S., Kunze, J., Lagoze, C., Wolf, M.: Dublin core metadata for resource discovery. Technical report (1998)Google Scholar
  42. 42.
    Berndl, E., Schlegel, K., Eisenkolb, A., Kosch, H.: Idiomatic persistence and querying for the W3C Web Annotation Data Model. In: Joint Proceedings of the 4th International Workshop on Linked Media and the 3rd Developers Hackshop Co-located with the 13th Extended Semantic Web Conference ESWC (2016)Google Scholar
  43. 43.
    Suominen, O., Ylikotila, H., Pessala, S., Lappalainen, M., Frosterus, M., Tuominen, J., Baker, T., Caracciolo, C., Retterath, A.: Publishing SKOS Vocabularies with Skosmos. Manuscript submitted for review (2015)Google Scholar
  44. 44.
    Scholz, H.: Die mittelalterlichen Glasmalereien in Mittelfranken und Nürnberg: extra muros, vol. 10. Deutscher Verlag für Kunstwissenschaft (2002)Google Scholar
  45. 45.
    Scholz, H.: Die mittelalterlichen Glasmalereien in Nürnberg: Sebalder Stadtseite. Deutscher Verlag für Kunstwissenschaft (2013)Google Scholar
  46. 46.
    Couprie, L.D.: Iconclass: an iconographic classification system. Art Libr. J. 8(2), 3249 (1983)CrossRefGoogle Scholar
  47. 47.
    Ball, A., Chen, S., Greenberg, J., Perez, C., Jeffery, K., Koskela, R.: Building a disciplinary metadata standards directory. Int. J. Digit. Curation 9(1), 142–151 (2014)CrossRefGoogle Scholar
  48. 48.
    Ben-Kiki, O., Evans, C., Ingerson, B.: YAML Ain’t Markup Language (YAML) version 1.1. yaml. org, Technical Report (2005)Google Scholar
  49. 49.
    Allcock, W., Bester, J., Bresnahan, J., Chervenak, A., Liming, L., Tuecke, S.: GridFTP: protocol extensions to FTP for the grid. Global Grid Forum GFD-RP 20, 1–21 (2003)Google Scholar
  50. 50.
    Whitehead, E.J., Wiggins, M.: WebDAV: IEFT standard for collaborative authoring on the web. IEEE Internet Comput. 2(5), 34–40 (1998)CrossRefGoogle Scholar
  51. 51.
    Marcial, L.H., Hemminger, B.M.: Scientific data repositories on the web: an initial survey. J. Am. Soc. Inf. Sci. Technol. 61(10), 2029–2048 (2010)CrossRefGoogle Scholar
  52. 52.
    Woodberry, E., Bailey, C.W.: SPEC Kit 292: Institutional Repositories. Australian Acad. Res. Libr. 39(2), 129–130 (2008)Google Scholar
  53. 53.
    Lynch, C.A., Lippincott, J.K.: Institutional repository deployment in the united states as of early 2005. D-lib Mag. 11(9), 1–11 (2005)Google Scholar
  54. 54.
    Smith, M., Barton, M., Bass, M., Branschofsky, M., McClellan, G., Stuve, D., Tansley, R., Walker, J.H.: DSpace: an open source dynamic digital repository. D-Lib Mag. 9(1) (2003). http://www.dlib.org/dlib/january03/smith/01smith.html
  55. 55.
    Van Garderen, P.: Archivematica: using micro-services and open-source software to deliver a comprehensive digital curation solution. In: Proceedings of the 7th International Conference on Preservation of Digital Objects, Vienna, Austria, pp. 145–149 (2010)Google Scholar
  56. 56.
    Flannery, D., Matthews, B., Griffin, T., Bicarregui, J., Gleaves, M., Lerusse, L., Downing, R., Ashton, A., Sufi, S., Drinkwater, G., Kleese, K.: ICAT: integrating data infrastructure for facilities based science. In: Fifth IEEE International Conference e-Science ’09, pp. 201–207 (2009)Google Scholar
  57. 57.
    Sufi, S., Mathews, B.: CCLRC scientific metadata model: version 2. Technical report, CCLRC technical report DL TR2004001 (2004)Google Scholar
  58. 58.
    Lecarpentier, D., Wittenburg, P., Elbers, W., Michelini, A., Kanso, R., Coveney, P., Baxter, R.: EUDAT: a new cross-disciplinary data infrastructure for science. Int. J. Digit. Curation 8(1), 279–287 (2013)CrossRefGoogle Scholar
  59. 59.
    Grainger, T., Potter, T., Seeley, Y.: Solr in Action. Manning, Cherry Hill (2014)Google Scholar
  60. 60.
    Beazley, M.: EPrints institutional repository software: a review. Partnership 5(2), 1 (2010)Google Scholar
  61. 61.
    Jensen, S., Plale, B.: Using characteristics of computational science schemas for workflow metadata management. In: IEEE Congress on Services—Part I, pp. 445–452 (2008)Google Scholar
  62. 62.
    Shanmugasundaram, J., Tufte, K., Zhang, C., He, G., DeWitt, D. J., Naughton, J.F.: Relational databases for querying XML documents: limitations and opportunities. In: Proceedings of the 25th International Conference on Very Large Data Bases (VLDB), pp. 302–314, San Francisco, CA, USA, Morgan Kaufmann Publishers Inc (1999)Google Scholar
  63. 63.
    Jones, M.B., Berkley, C., Bojilova, J., Schildhauer, M.: Managing scientific metadata. IEEE Internet Comput. 5(5), 59–68 (2001)CrossRefGoogle Scholar
  64. 64.
    Yang, R., Deng, X., Kafatos, M., Wang, C., Wang, X.S.: An XML-based Distributed Metadata Server (DIMES) supporting earth science metadata. In: Proceedings Thirteenth International Conference on Scientific and Statistical Database Management. SSDBM, pp. 251–256 (2001)Google Scholar
  65. 65.
    Baru, C., Moore, R., Rajasekar, A., Wan, M.: The SDSC storage resource broker. In: Proceedings of the Conference of the Centre for Advanced Studies on Collaborative Research, CASCON ’98, p. 5. IBM Press, New York (1998)Google Scholar
  66. 66.
    Singh, G., Bharathi, S., Chervenak, A., Deelman, E., Kesselman, C., Manohar, M., Patil, S., Pearlman, L.: A metadata catalog service for data intensive applications. In: Supercomputing, 2003 ACM/IEEE Conference, pp. 33–33 (2003)Google Scholar
  67. 67.
    Deelman, E., Singh, G., Atkinson, M.P., Chervenak, A., Hong, N.C., Kesselman, C., Patil, S., Pearlman, L., Su, M.H.: Grid-based metadata services. In: Proceedings. 16th International Conference on Scientific and Statistical Database Management, pp. 393–402 (2004)Google Scholar
  68. 68.
    Pham, Q., Malik, T., Foster, I.T., Di Lauro, R., Montella, R.: SOLE: linking research papers with science objects. In: IPAW, pp. 203–208. Springer, Berlin (2012)Google Scholar
  69. 69.
    McCandless, M., Hatcher, E., Gospodnetic, O.: Lucene in Action, 2nd edn. Covers Apache Lucene 3.0. Manning Publications Co., Greenwich (2010)Google Scholar
  70. 70.
    Belhajjame, K., B’Far, R., Cheney, J., Coppens, S., Cresswell, S., Gil, Y., Groth, P., Klyne, G., Lebo, T., McCusker, J., Miles, S., Myers, J., Sahoo, S., Curt, T.: PROV-DM: the PROV data model. Project report (2013)Google Scholar
  71. 71.
    Schandl, T., Blumauer, A.: PoolParty: SKOS thesaurus management utilizing linked data. In: The Semantic Web: Research and Applications: 7th Extended Semantic Web Conference, ESWC 2010, Heraklion, Crete, Greece, May 30–June 3, 2010, Proceedings, Part II, pp. 421–425. Springer, Berlin, Heidelberg (2010)Google Scholar
  72. 72.
    Culhane, W., Kogan, L., Jayalath, C., Eugster, P.: LOOM: optimal aggregation overlays for in-memory big data processing. In: 6th USENIX Workshop on Hot Topics in Cloud Computing (HotCloud 14), Philadelphia, USENIX Association (2014)Google Scholar
  73. 73.
    Deelman, E., Berriman, B., Chervenak, A., Corcho, O., Groth, P., Moreau, L.: Metadata and provenance management. In: Scientific Data Management: Challenges, Technology, and Deployment, 1st edn. (2009)Google Scholar
  74. 74.
    Li, Y., Manoharan, S.: A performance comparison of SQL and NoSQL databases. In: IEEE Pacific Rim Conference on Communications, Computers and Signal Processing (PACRIM), pp. 15–19 (2013)Google Scholar
  75. 75.
    Boicea, A., Radulescu, F., Agapin, L.I.: MongoDB vs oracle-database comparison. In: EIDWT, pp. 330–335 (2012)Google Scholar
  76. 76.
    Jensen, S., Ghoshal, D., Plale, B.: Evaluation of two XML storage approaches for scientific metadata. Indiana University Department of Computer Science Technical Report (2011)Google Scholar
  77. 77.
    Wood, L., Le Hors, A., Apparao, V., Byrne, S., Champion, M., Isaacs, S., Jacobs, I., Nicol, G., Robie, J., Sutor, R., Wilson, C.: Document object model (DOM) level 1 specification. W3C recommendation (1998)Google Scholar
  78. 78.
    Cremer, C., Kaufmann, R., Gunkel, M., Pres, S., Weiland, Y., Müller, P., Ruckelshausen, T., Lemmer, P., Geiger, F., Degenhard, S., Christina, W., Lemmermann, N., Holtappels, R., Strickfaden, H., Hausmann, M.: Superresolution imaging of biological nanostructures by spectral precision distance microscopy. Biotech. J. 6(9), 1037–1051 (2011)Google Scholar
  79. 79.
    Prabhune, A., Stotzka, R., Jejkal, T., Hartmann, V., Bach, M., Schmitt, E., Hausmann, M., Hesser, J.: An optimized generic client service API for managing large datasets within a data repository. In: Big Data Computing Service and Applications (BigDataService), IEEE First International Conference, pp. 44–51 (2015)Google Scholar
  80. 80.
    Jordan, D., Evdemon, J., Alves, A., Arkin, A., Askary, S., Barreto, C., Bloch, B., Curbera, F., Ford, M., Goland, Y., Guzar, A.: Web services business process execution language version 2.0. OASIS Stand. 11(120), 5 (2007)Google Scholar
  81. 81.
    Chandna, S., Tonne, D., Jejkal, T., Stotzka, R., Krause, C., Vanscheidt, P., Prabhune, A.: Software workflow for the automatic tagging of medieval manuscript images (SWATI). In: SPIE/IS&T Electronic Imaging, p. 940206 (2015)Google Scholar
  82. 82.
    Forman, I.R., Forman, N.: Java Reflection in Action. Manning Publication Co., Greenwich (2004)MATHGoogle Scholar
  83. 83.
    Altintas, I., Anand, M.K., Crawl, D., Bowers, S., Belloum, A., Missier, P., Ludäscher, B., Goble, C.A., Sloot, P.M.: Understanding Collaborative Studies Through Interoperable Workflow Provenance. Springer, Berlin (2010)CrossRefGoogle Scholar
  84. 84.
    Braun, U., Seltzer, M.I., Chapman, A., Blaustein, B.T., Allen, M.D., Seligman, L.: Towards query interoperability: PASSing PLUS. In: TaPP, pp. 1–10 (2010)Google Scholar
  85. 85.
    Missier, P., Ludäscher, B., Bowers, S., Dey, S., Sarkar, A., Shrestha, B., Altintas, I., Anand, M.K., Goble, C.: Linking multiple workflow provenance traces for interoperable collaborative science. In: The 5th Workshop on Workflows in Support of Large-Scale Science, pp. 1–8 (2010)Google Scholar
  86. 86.
    Karau, H., Konwinski, A., Wendell, P., Zaharia, M.: Learning Spark: Lightning-Fast Big Data Analysis. O’Reilly Media, Inc., Sebastopol (2015)Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2017

Authors and Affiliations

  • Ajinkya Prabhune
    • 1
  • Rainer Stotzka
    • 1
  • Vaibhav Sakharkar
    • 1
  • Jürgen Hesser
    • 2
  • Michael Gertz
    • 3
  1. 1.Karlsruhe Institute of TechnologyEggenstein-LeopoldshafenGermany
  2. 2.Department of Radiation OncologyHeidelberg UniversityHeidelbergGermany
  3. 3.Institute of Computer ScienceHeidelberg UniversityHeidelbergGermany

Personalised recommendations