Introduction

In 2016, Wilkinson et al. introduced the FAIR principles of data management - that research data should be Findable, Accessible, Interoperable and Reusable. In order to meet the requirements of the Findable aspect of FAIR data, a dataset must be described by rich metadata in a searchable resource and the dataset must be assigned a clearly labelled persistent, unique identifier. The metadata describing the data resource should be released with a clear data usage license, detailed data provenance and ensure that the metadata meet domain-relevant community standards.

Online metadata catalogues for environmental data have been documented since early in the history of the World Wide Web (Günther et al., 1996). This early approach was focussed on helping users to discover, or find, data relevant for a given topic and to access it quickly in a user-friendly manner. It was also beneficial in meeting legislative requirements, such as the European Access to Information on the Environment Regulations (European Parliament, 2003). Even at this stage the issue of semantic interoperability of metadata and data was recognised as being important. As new paradigms of data have come to the fore, this conclusion has further grown in importance (Hilbring & Usländer, 2006; Proctor et al., 2010; Tanhua et al., 2019).

In the context of environmental Big Data, Vitolo et al. (2015) call for the use of data catalogues to allow the discovery of data services and their functionality. However, they point out that semantic heterogeneity is a hurdle which must be overcome in searching through catalogue services. Leadbetter et al. (2014) and Leadbetter & Vodden (2016) demonstrate how interoperable, homogenous semantics can provide improved knowledge-building and cross-disciplinary data integration in environmental data catalogues.

One such cross-disciplinary activity is Marine Spatial Planning (MSP) which is concerned with the management of the distribution of human activities in space and time in and around seas and oceans to achieve ecological, economic and societal objectives and outcomes (Ehler et al., 2019). Nylén et al., 2019 include as one of their steps in the MSP data process the establishment of a metadata catalogue for the data to be used in the process. Within their framework, the catalogue should be able to differentiate between the original versions of existing spatial data and newly created data products derived from one or more original datasets. This differentiation should also include the processing steps taken to generate the new data products. The data catalogue should also be able to handle both observed and modelled data, and for modelled data to provide information on the input parameters to the model and the methods employed by the model. Flynn et al. (2019) conclude that a data cataloguing system for MSP can allow the availability and suitability of data for the MSP process to be assessed at regular review cycles. Friddell et al., 2014 demonstrate that in other cross-disciplinary topics, in their case polar research, modularity is required in order to represent datasets, projects or programmes and other polar data resources within the catalogue system.

Marine Spatial Planning is also a European legislative requirement (European Parliament, 2014), as are other data integration programmes including the Marine Strategy Framework Directive (European Parliament, 2008) and the INSPIRE Spatial Data Infrastructure. A data catalogue should recognise these targets and look to meet the technical requirements that they set as well as highlighting which datasets may be relevant to them. These include, for example, the delivery of ISO19115/19139 standard metadata to comply with the INSPIRE Spatial Data Infrastructure (Craglia & Annoni, 2007). In addition to legislative requirements, community standards should also be adhered to, such as the European Directory of Marine Environmental Datasets (Schaap & Lowry, 2010) and the Marine Community Profile (Proctor et al., 2010).

Therefore, in the sphere of marine science data management, the need for a modular approach to data cataloguing which is designed to meet a number of requirements highlighted above (see Table 1) can be clearly seen. In this paper we describe a data cataloguing system developed at and in use at the Marine Institute, Ireland and will expand on the data model used in developing the catalogue; discuss the approach taken to implementing the catalogue; and discuss our findings and future work.

Table 1 Functional requirements for a marine data cataloguing system

Data model

The data model used within this modular catalogue is focused on a number of high-level concepts and their inter-relationships, illustrated in Fig. 1. These concepts are modularly developed as classes within the data model and are described below. Examples of instances of the classes are given in the text and are also summarised in Table 2.

Fig. 1
figure 1

A high-level overview of the data model used in the modular data catalogue approach. The overall class structure is shown in the Unified Modelling Language

Table 2 Examples of instances of the classes in the Data Catalogue data model

Dataset

First is the high-level Dataset class (Fig. 2). It may combine many different parameters, collected at multiple times and locations, using different instruments. A Dataset is linked to its storage and retention information and the classification, including licensing, associated with the Dataset under a machine actionable data policy. This machine actionable data policy is derived from a set of business rules associated with the data classifications laid out in the institutional data policy (such as Marine Institute, 2017). Therefore, a Dataset which is marked as containing personal data, as defined by the European General Data Protection Regulation (Voigt & Von dem Bussche, 2017) or business sensitive data will not be made publicly available. Examples of a Dataset include an institution’s entire research vessel Conductivity-Temperature-Depth profile archive; or a spatial dataset such as the distribution and abundance of cetacean species within an exclusive economic zone.

Fig. 2
figure 2

A more detailed UML view of the Dataset, Dataset Collection Activity and Dataset Collection classes

Dataset Collection Activity

Related to a Dataset is a Dataset Collection Activity (Fig. 2). This class specialises the Dataset in that it has a mandatory end date and also a mandatory platform element, which indicates the vehicles, structures or organisms capable of bearing instruments or tools for the collection of physical, chemical, geological or biological samples or data. Examples of a Dataset Collection Activity include a research vessel survey or cruise; or the deployment of a moored buoy at a specific location for a given time period.

Platform

Within the INSPIRE spatial data infrastructure, the Environmental Monitoring Facilities component describes the environmental monitoring facility (a research vessel, a satellite) as a spatial object in the context of INSPIRE and observations and measurements linked to the environmental monitoring facility. (INSPIRE TWG EMF, 2013). The Platform class (Fig. 3) of this catalogue system seeks to carry the attributes required to complete an Environmental Monitoring Facilities instance when combined with details from the Dataset Collection Activity class. It is also synonymous with the GeoLink class Platform which describes a “physical object of significance enabling observations resulting in a Dataset” (Krisnadhi et al., 2015). To this end a Platform instance is attributed with: its platform type; whether or not it is a mobile platform; which environmental regime it operates in; its operational start date, and if applicable, end date; and which Organisation is responsible for the platform. Where available, the International Council for the Exploration of the Seas platform code is also attributed to the Platform. Example instances of the Platform class include a research vessel, such as the RV Celtic Explorer, or an individual Argo programme drifting profiling float.

Fig. 3
figure 3

A more detailed UML view of the Geographic Feature, Organisation Platform and Programme classes

Dataset Collection

The Dataset Collection class (Fig. 2) is used to provide a link between a Dataset Collection Activity (e.g. a research vessel based survey; a deployment of a mooring) and a Dataset. As such, the Dataset Collection may be a subset of both the data collected by the Dataset Collection Activity (a limited set of the full parameters from that Activity) and the Dataset (possibly limited in time and/or parameter space). The Dataset Collection is linked to both a Dataset Collection Activity and a Dataset; and to the Device(s) used to sample the environment for a given range of parameters. An example of a Dataset Collection may be the Conductivity-Temperature-Depth profiles taken on a research vessel survey allowing the individual sensors to be connected to the activity and the calibration of those sensors to be connected with the associated measurements. A further example could be the time series of atmospheric weather conditions recorded during the deployment of a sea-surface monitoring buoy which allows for the change of sensors at service intervals of the buoy to be properly tracked within the catalogue.

Geographic Feature

A Geographic Feature (Fig. 3) is a mandatory attribute of a Dataset Collection Activity, and a recommended attribute of a Dataset. The Geographic Feature within this data catalogue model is closely related to the Open Geospatial Consortium and International Organisation for Standardisation’s Simple Feature Access model (Herring, 2011). To this extent, the Geographic Feature class stores the geographic coordinates of points, lines, and polygons and the feature type for both the Simple Feature Access model and the European Commission’s INSPIRE spatial data infrastructure. An instance of the Geographic Feature class may be attributed as a child of another Geographic Feature in order to build hierarchical networks of Geographical Features, such as river catchments and sea areas. Further attributes of a Feature within this model are the Coordinate Reference System used to define the latitude and longitude of the point, line or polygon; a URL to a definition of the Geographical Feature; and an organisation responsible for the Geographical Feature. Example instances of the class are a sampling location; a research vessel survey track; or a polygon defining a lake or river catchment area.

Programme

The Programme class (Fig. 3) is similar in scope to the EarthCube GeoLink ontology’s Program class in that instances represent a “formally recognized scientific effort receiving significant funding, requiring large scale coordination” (Krisnadhi et al., 2015). An instance of the Programme class may have a coordinating organisation, and a number of contributing and funding organisations as well as the name of an individual who is the principal investigator of the Programme. A Programme is time bound by a start date and an optional end date, and may have a URL link to a website describing the Programme. A Programme may have a number of deliverables associated with it. An instance of the Programme class may also be the child of another instance of the same class.

Device

As stated above, Dataset Collection Activity takes place via a Platform and is linked to a Dataset through a Dataset Collection which describes the deployment of a Device on a Platform. The Device class (Fig. 4) is designed to allow a SensorML (Botts and Robin, 2007) record to be constructed for a given Device instance. As such, an instance of the Device class carries the input and output parameters of the Device, its measurement units, its manufacturer, operating organisation and start and end dates. It also carries links to the documentation regarding the calibration history of the Device. The Device class is more detailed than the similar GeoLink class of Instrument as it holds the Device's serial number as well as the instrument type from a controlled vocabulary.

Fig. 4
figure 4

A more detailed UML view of the Device class

Organisation

The Organisation class (Fig. 3) is designed to capture the details of research institutes, data holding centres, monitoring agencies, governmental and private organisations that are in one way or another engaged in oceanographic and marine research activities, data & information management and/or data acquisition activities. It is synonymous with the GeoLink Organisation class, but is more detailed in its attribution. Attributes include the full postal address of the organisation and institutional contact details (email, telephone, fax number, web site) which are used instead of personal contact details in any publically available metadata in order to comply with the European General Data Protection Regulation. A link to the page where the information was collected from is maintained. Where an organisation has an entry in the European Directory of Marine Organisations (Schaap & Lowry, 2010) the unique identifier from that directory is also assigned to the Organisation record here.

Re-use of community-managed controlled vocabulary terms

Many attributes of the classes in the data model are constrained against well-managed, community governed controlled vocabularies, which addresses one of the Interoperability aspects of the FAIR principles. These are highlighted in Table 3. Controlled vocabularies provide consistency in the labelling of metadata and, when published online, allow for interoperability through accessing labels and definitions through web services (Schaap and Lowry, 2010). Controlled vocabularies which have a hierarchy of terms published, that is a “thesaurus” (McGuinness, 2002), allow the more coarse grained terminology which is often used as a data discovery vector to be inferred from fine grained terminology which is important in usage metadata (see Fig. 5). Rather than storing a local copy of the full hierarchy of the vocabulary terms, the data catalogue solution presented here only tags its entities with the finest-grained vocabulary terms, and when coarser-grained terms are required to be attributed to the dataset for discovery purposes, these are inferred from queries to web services at the vocabulary service host organisations. Listing 1 shows an example SPARQL (the query language for semantic databases) query which builds up the hierarchy for a parameter usage vocabulary term which is illustrated in Fig. 5.

Table 3 The use of community governed controlled vocabularies to constrain varies properties within the data model
Fig. 5
figure 5

Hierarchy of inferred vocabulary terms as a result of tagging a Dataset with a term from the British Oceanographic Data Centre (BODC) Parameter Usage Vocabulary. The codes in brackets - e.g. P01, P02 - indicate the collection identifier from the NERC Vocabulary Server, such that http://vocab.nerc.ac.uk/collection/P01/current/ returns the BODC Parameter Usage Vocabulary. MSFD indicates the European Commission’s Marine Strategy Framework Directive; ISO indicates the International Organisation for Standardisation; GEMET indicates the European Environment Agency’s General Multilingual Environmental Thesaurus; and INSPIRE is the European Commission’s Spatial Data Infrastructure

Listing 1
figure 6figure 6figure 6

An example SPARQL query to be issued against the NERC Vocabulary Server to build the hierarchy shown in Fig. 5 for the code which represents “Practical salinity of the water body by CTD and computation using UNESCO 1983 algorithm and NO calibration against independent measurements” with the URL http://vocab.nerc.ac.uk/collection/P01/current/PSALCU01/

Implementation

In order to implement the data model described above, the architecture described below and illustrated in Fig. 6 has been adopted.

Fig. 6
figure 7

A high-level view of the adopted system architecture. The component numbers are identified in the main body of the text

The first component (component 1 in Fig. 6) is an internal repository of metadata, developed using the Drupal content management system. Drupal is an open-source, community based framework enabling rapid development of web applications and is particularly suited to content management systems such as the Data Catalogue. The flexible native content management ability of a framework such as Drupal was key to the decision to input metadata in it rather than in a more familiar data cataloguing platform such as CKAN or GeoNetwork. In addition to core Drupal functionality provided ‘out-of-the-box’, the Data Catalogue also makes use of extended functionality through the inclusion of contributed software modules which can be managed within the Drupal framework. It is also possible to develop new modules to provide custom functionality that may not be available as core or contributed modules. It should be noted that:

  • This repository is designed as an internal intranet portal only and not for general public access. A subset of relevant and appropriately classified data descriptions as defined by the actionable data policy are shared externally, only after criteria for external publication have been met

  • The Data Catalogue implements role based access control allowing user access to be appropriately managed, e.g. limit create/update privileges to data owners and administrators.

  • The Data Catalogue is available in read-only mode to any users already authenticated on the internal network. In this case a restricted view is provided, ensuring that any restricted access information is hidden.

The Data Catalogue has been developed to export metadata for datasets and services in ISO 19115/19139 based XML format in compliance with the INSPIRE implementing rules for metadata (component 2 of Fig. 6). This allows dataset descriptions and associated information (e.g. owning organisation, programme, sensor information etc.) to be published and/or harvested from the Data Catalogue using industry standard formats and metadata rules. In addition, the internal Catalogue supports the DataCite metadata schema, allowing a completed data description entry to be exported in support of the minting of Digital Object Identifiers (DOI) for published data. The assignment of a DOI to a dataset is a well-documented paradigm to allow data to be cited within the scientific literature, and for a data centre or data publishing organisation to assert that an assessment of the technical quality (metadata, data format) of the dataset has been passed allowing the data to be maintained and served for the foreseeable future (Callaghan et al., 2012). This is the only place in the Data Catalogue system where individual’s names are made publically available alongside the dataset, as dataset authors for the citation. As can be seen in Fig. 2, this is not the only place where individual’s names are stored in the Data Catalogue system. In these other occurrences of an individual’s name only organisational-level contact information (such as an info@example.org email address) and contact points based on an individual’s organisational affiliation are made available to public view. The operational procedure which has been established around this is to obtain explicit consent from all dataset authors for this prior to the publication step in order to comply with the European General Data Protection Regulation. When a DOI is assigned to an entity in the Data Catalogue, it is recommended best practice to create and store the shortened form of the DOI from the ShortDOI.org service at the time the DOI is minted.

A subset of the content maintained within the internal Data Catalogue is shared externally. This publication process has been developed to make use of the standard metadata export functionality (XML formatted files) and the external facing GeoNetwork instance (component 3 of Fig. 6). GeoNetwork is an open source catalogue application to manage spatially referenced resources. It provides powerful metadata editing and search functions as well as an interactive web map viewer. It is currently used in numerous Spatial Data Infrastructure initiatives worldwide. A custom implementation of GeoNetwork has been developed to serve as the external/public facing web portal for the Data Catalogue. A number of steps are involved in the publication process, which are described below and illustrated in Fig. 7. Content is regularly exported from the internal data catalogue in ISO 19139 XML. This process can be configured to run as a background task or be manually initiated if updates are required immediately. Publication criteria and rules are applied through the machine-actionable data policy to ensure that only content appropriate for publication is included in the export process. These rules are based on data classification, publication status, licensing etc. and can be updated as required. Once exported, metadata XML files are moved to a central staging area located on the external perimeter network or ‘demilitarized zone’ (DMZ). This serves as the collection point for the GeoNetwork instance. The GeoNetwork instance includes an automated and configurable harvest capability. This allows the previously exported data descriptions to be imported and published on the public facing portal.

Fig. 7
figure 8

The metadata publication process from internal data catalogue to external GeoNetwork instance. The Data Steward and Data Coordinator Roles are described in Leadbetter et al. (2019)

While not a core component of the Data Catalogue, the external facing Catalogue, through GeoNetwork, supports integration with other data serving applications (components 4 and 5 of Fig. 6). This allows Data Catalogue users to download or link to the underlying data as described by in the Data Catalogue. For spatial data this is achieved via Open Geospatial Consortium compliant web services from a GeoServer instance. GeoServer implements a number of standards such as Web Feature Services, Web Map Services, and Web Coverage Services. Another important data serving application is ERDDAP; a data server that gives a simple, consistent way to download subsets of scientific datasets in common file formats and make graphs and maps (Simons, 2019). ERDDAP has been developed by the National Oceanographic and Atmospheric Administration in the United States to provide access to data stored in multiple different formats through a web interface, and using RESTful URLs through a web service, brokering the storage formats to a number of data delivery formats. ERDDAP is a useful tool in scientific data delivery, not just for marine science, as it can access and serve any tabular or gridded data. These data integration components are included here to provide a complete, unified solution view of metadata and data delivery.

This approach provides a clear decoupling and separation of potentially sensitive internal data descriptions and externally published metadata. Only information explicitly categorised as suitable for publication is persisted on the external system. The external facing portal publishes a ‘read-only’ copy of the metadata contained within the Data Catalogue and mitigates any data loss if the external system has been compromised. Core user accounts and system provisioning details are maintained on the internal Drupal system, residing on a controlled and secure network. The publication criteria can be updated and modified at any time if requirements or user needs change. It makes use of industry standard metadata and provides an excellent reference system for other implementers that may be interested in using metadata in a similar way, for example Ireland’s Open Data Portal (https://data.gov.ie/).

Conclusions

We have presented a reusable, modular approach to cataloguing marine science data which meets a number of functional requirements derived from both academic literature and legislative drivers. The Data Catalogue system presented above also meets, at a base level, the requirements of the FAIR principles of data management (see Table 4). One particular development of note in the “Findability” principle is that the data model is presented within the HTML representations of the metadata landing pages using JSON-LD encoded Schema.org (see Listing 2). This improves the discoverability of the content of the Data Catalogue through exposing it to tools such as Google’s Dataset Search.

Table 4 How the data cataloguing platform described in this paper addresses the requirements of the FAIR principles of Data Management
Listing 2
figure 9figure 9figure 9

A Schema.org representation of a dataset from within this data catalogue model. The original record is available at http://data.marine.ie/geonetwork/srv/eng/catalog.search#/metadata/ie.marine.data:dataset.2752

Although Table 4 shows a good alignment of the work presented above with the FAIR principles, there remains work to complete on the formalised representation of the data in structured formats beyond Schema.org and in the provenance of the datasets described in the Data Catalogue. Firstly, although GeoNetwork supports a generic Resource Description Framework (Miller, 1998) description of metadata records using the Data Catalog vocabulary (Maali et al., 2014) this requires extension to add in specific terms from domain specific ontologies such as GeoLink. This should also allow for more formalised descriptions of linkages between various datasets using richer semantics to describe the connections. Better connectivity between datasets and reports which use them is also required in the future. A further semantic application would be the use of spatial semantics to provide textual geographic search, which requires extensions to the existing structured thesauri describing geographic regions of the sea, such as the SeaVoX salt and freshwater body gazetteer (http://vocab.nerc.ac.uk/collection/C19).

Marine science programmes often collect biological samples in combination with environmental data. A collection of physical samples is analogous to a Dataset, with added complexity due to the samples tangibility. For example, a Dataset Collection Activity in the form of a marine research vessel survey may have a primary goal to measure stock abundance of a specific fishery (i.e. haddock or cod). In this example, a large part of the survey will involve taking biological samples (in the form of fish otoliths) for aging the population to report on stock recruitment. The resulting age dataset will be used to inform policy advice on regulatory measures regarding fishing effort in succeeding years (Marine Institute, 2018). The biological samples (in this case, the otoliths), and the associated fish metadata, are often stored for an extended period of time after the Dataset Collection Activity, for scientific reproducibility and transparency of the age dataset generated. In addition, otoliths can be used for microchemical analyses to investigate fish diet and habitat (Campana & Thorrold, 2001), which can be valuable for fisheries conservation efforts in subsequent years. Therefore, the necessity for appropriate physical and digital storage of biological samples and their associated metadata is evident. We anticipate the development of an optional accessory extension to the Data Catalogue to model biological samples and their associated metadata. The extension will utilize select concepts from the Data Catalogue, such as Geographic Feature and Programme, but also include additional metadata. For example, in the fisheries use-case, phenotype data (i.e. fish length and weight) will be associated with each biological sample (i.e. otolith). We expect the physical sample’s extension to the Data Catalogue to become a useful tool for long term archiving and reusability of physical samples resulting from various marine science programmes.

Finally, there is ongoing work in the data catalogue beyond the FAIR principles, as these offer a base level of good data stewardship (Boeckhout et al., 2018). One example is to automate assessments of the maturity of the stewardship of datasets within the Data Catalogue system. This takes the Data Stewardship Maturity Framework of Peng et al. (2015) as its starting point and will assess the values encoded for various elements in the Data Catalogue’s data model to produce a rating for a given dataset. As discussed by Flynn et al. (2019), this approach can also be specialised in order to provide an assessment of the suitability of a dataset for a given application, in the case of their study for Marine Spatial Planning.