The various elements of the implementation pack are outlined in Table 3 and are expanded in the sections below.
Data management plan
A Data Management Plan (DMP) describes what data will be created during a project, how they will be stored during the project, how they will be archived at the end of the project and how access will be granted (where appropriate). Although a DMP should be prepared before a project begins, it must be referred to and reviewed throughout, as well as after the project, so that it remains relevant.
In accordance with the Marine Institute’s Data Policy (Marine Institute 2017) and in keeping with the Government’s Open Data Policy (Government Reform Unit 2017) “...data will by default be made available for reuse unless restricted...” Most data generated during a project can be successfully archived and shared. However, some data are more sensitive than others. A DMP will help identify issues related to confidentiality, ethics, security and copyright before initiating a project and it is important to consider these issues before initiating the project. Any challenges to data sharing (e.g. data confidentiality) should be critically considered in a plan, with solutions proposed to optimise data sharing.
Directorate General for Research & Innovation (2016) states that DMPs are a “key element of good data management”.
Funding bodies do not usually ask for a lengthy plan; in fact in 2011 the US’s National Science Foundation (NSF) policy stated all NSF proposals must have a data management plan of no more than two pages (National Science Foundation 2011). The UKs Natural Environment Research Council (NERC) proposed a short ’Outline Data Management Plan (ODMP)’ ((Natural Environment Research Council 2019)) with the view that full data management is completed by the Principal Investigator (PI) within three months of the start date of a grant. The main purpose of an ODMP is to identify if a project will in fact produce data and the estimated quantity of said data.
Under H2020 the Commission provides a DMP template (Directorate General for Research & Innovation 2018), the use of which is voluntary; however the submission of a first version of a DMP is considered a deliverable within the first 6 months of the project. H2020 FAIR stipulates that a DMP should be submitted only as part of the ORD (Open Research Data) pilot; all other proposals are encouraged to submit a DMP but at the very least are expected to address good research data management under the impact criterion addressing specific issues. DMPs (under ORD pilot) should include information on:
Data Management - during & after the project
What data the plan covers
Methodologies & Standards
Data Accessibility – Sharing / Open Access
Data Curation & Preservation
Good research data management criterion should address:
Standards to be applied
Data accessibility for verification & reuse including reasons why the data cannot be made available, if applicable
Data Curation & Preservation methods
In general a DMP should contain the following elements to ensure the data will be managed to the highest standards throughout the project data lifecycle in keeping with the Marine Institute’s Data Policy principles around the management of data. These elements include (but are not limited to):
Project & Data Description
Data Retention & Preservation
Data Reuse (Sharing and Publication)
In order to prepare a DMP, there is evidence to suggest having a generic template available, with commentary, is useful in guiding a user in addressing the appropriate considerations. This can simply be in the format of a checklist of questions in a document or alternatively are electronic tools available, such as the Digital Curation Centre’s (DCC) DMPOnlineFootnote 1 to help navigate a user through the appropriate sections. As part of the Data Management Quality Management Framework (DM QMF) Implementation Pack a Word template has been created which utilises the DCC checklist. This has been piloted for several in-house Marine Institute Data Processes receiving very positive responses.
Data management costs relating to the preparation of data for deposit and ingestion, data storage, ongoing digital preservation and curation after the project, can be included in a data management plan. Good forward thinking can really help to illustrate, and achieve, time savings in accessing the data by avoiding the costly task of recreating data that has been lost or corrupt.
The UK Data Archive has developed a Costing Tool that can be used for costing data management in the social sciences. This is based on each activity (e.g. in the data management checklist) that is required to make research data shareable beyond the primary research team. It can be used to help prepare research grant applications.
The requirements document should contain an agreed set of clear requirements for the data being produced. It is not intended to be an exhaustive list of requirements, rather a high level set of functional requirements that the process must achieve. These requirements maybe either prerequisites in order to commence a data process or requirements to be met in the design or output of a data process. For example, for the Marine Institute’s process to publish data through an instance of an Erddap server (Simons 2017), prerequisites include: a dataset must have a public-facing record in the Marine Institute’s data catalogue; and that the dataset must not contain personal, sensitive personal, confidential, or otherwise restricted data. In addition, the criteria for successfully meeting these requirements should be specified, ensuring the data produced meets the needs of consumers of the data.
A Process Flow is a visual representation of an activity or series of activities, using standard business notation, illustrating the relationship between major components and demonstrating the logical sequence of events. A Process Flow describes ‘the what’ of an activity and a Procedure describes ‘the how’. Together they form part of a Data Management Framework. The Process Flow may be split across multiple levels, but at the highest level should encompass the complete lifecycle of the data process (see Fig. 2). Process flow mapping involves gathering everyone involved in the process (administrators, contractor, scientists) together and determining what makes that process happen: inputs, outputs, steps and process time. A process map takes that information and represents it visually.
The visual aspect is key: but the benefits go beyond making it easier to understand or simple to grasp. Having every key team member aware and included improves morale by having a visual representation of what everyone is working towards. Where problems are obvious, team members have a part in creating the solution. In order to ensure consistency across a suite of process flows, they are drawn using the Business Process Model and Notation (Object Management Group 2011).
All parties can discover exactly how the process happens, not how it is supposed to happen. In creating the process flow, discrepancies can be clearly observed occurring between the ideal and the reality. Once a process is mapped, it can be examined for non-value-added steps. Unnecessary repetitions or time-wasting side-tracks, can be clearly identified and dealt with; being removed or altered as needed.
A complete Process Flow can provide a clear vision of the future. After pinpointing problems and proposing solutions, there is an opportunity to re-map the process to what it should be. This shares the big picture with a team; each contributing member is then able to carry out improvements with a shared vision in mind. An example process flow for an ocean modelling dataset is shown in Fig. 4.
A process flow highlights duplicate processes across an organisation as well as variant practices, allowing an organisation to prune out the inefficient and propagate the most effective.
Each process flow is supplemented with an accompanying Process Flow Data Sheet (part of the Implementation Pack), which provides context to each process. Moreover, the Process Flow Data Sheet allows process owners and data stewards to record information, at an individual process flow level.
From a user perspective, the Process Flow Data Sheet is structured as a series of questions; mandatory information is clearly indicated, while optional information can be recorded as ‘N/A’ if deemed not applicable for a given process. Operational planning and control ensures that each process:
Defines and responds to the requirements for the data product or service.
Defines the acceptance criteria for the process output to ensure that requirements are met.
Is fully documented through each step, providing traceability and confidence that each planned activity has been performed.
Is modified only when changes are planned and reviewed to understand the impact when made operational.
Standard operating procedures
A Standard Operating Procedure is a set of step-by-step instructions compiled to perform the activities described by the Process Flow. Depending on complexity there can be multiple Standard Operating Procedures contained in a single Process Flow. Within the context of this implementation pack, the Process Flow and the Standard Operating Procedure are the main mechanisms used to capture and retain organisational knowledge. A template for Standard Operating Procedures has been developed as part of the Implementation Pack, and covers:
Purpose and scope of the procedure
Abbreviations and terminology used in the procedure
Roles and responsibilities required to carry out the procedure
Detailed description of the procedure
Data access and security
Data quality control
Data backup and archive
Reporting requirements (including legislative requirements on data delivery)
Recommendations to improve the procedure
The documentation is then stored in a document management system, allowing for version control. Figure 3 shows an example of a Standard Operating Procedure written in Markdown and stored in a private GitHub repository. Where appropriate, the Standard Operating Procedures are being migrated from plain documentation to automated workflows, such as Jupyter notebooks, to demonstrate reproducibility in the data processing workflow (Fig. 4).
Data catalogue entries
The Marine Institute’s data catalogue consists of an internal content management system and a public facing, standards compliant catalogue service. This decoupling of content and service is important as it allows a full data catalogue to be maintained inside the corporate firewall, with only those datasets which are deemed appropriate for public consumption published to the wider community. Within the content management system, this differentiation of datasets is achieved through an actionable version of the Marine Institute data policy. The logic applied by this actionable data policy ensures that non-open categories of datasets remain in the internal catalogue only.
The internal content management system manages metadata related to datasets, dataset collection activities, organisations, platforms and geographic features. In this context a dataset may be comprised of the data from one or more collection activities, or may be a geospatial data layer, or may be non-spatial data that is logically grouped. A dataset optionally has a start and end time and an associated geographic feature. A dataset collection activity is, for example, a research vessel cruise or survey; or the deployment of a mooring at a site. A dataset collection activity must have a start date, an end date, and must be associated with both a geographic feature and a platform (such as a research vessel). A dataset collection activity is also linked to an associated dataset. The concept of geographic feature here links a dataset or dataset collection activity to a representation of the spatial coverage of the dataset. At the coarsest level of detail this will be a bounding box of the extent of the dataset, but a finer level of detail is recommended such as a representation of the shape of a research vessel survey track.
In order to ensure that the metadata in the data catalogue has a level of consistency and interoperability, a number of controlled vocabularies are used and referenced. These may be domain specific, as in those used by the SeaDataNet community (Schaap and Lowry (2010), Leadbetter et al. (2014)) or more generalised, as in the ISO topic categories or as in the INSPIRE Spatial Data Infrastructure.
The internal content management system has functionality to export ISO 19115 metadata, encoded as ISO 19139 XML, to the public facing catalogue server software. In turn, and aligned with the requirements of the European Commission’s INSPIRE Spatial Data Infrastructure, the catalogue server is compliant with the Open Geospatial Consortium’s Catalog Service for the Web standard. The content management system also exposes INSPIRE compliant Atom feeds for data download services and Schema.org encoded datasets descriptions to enhance the findability of datasets.
Digital object identifiers for datasets
Following the guidance laid out in Leadbetter et al. (2013), digital object identifiers (dois) may be assigned to datasets in the Marine Institute data catalogue under certain circumstances. dois may only be applied to datasets which are in the public facing data catalogue, therefore in this system any non-open categories of datasets may not receive a doi. Further, for a dataset in the data catalogue to receive a doi it must have a publicly accessible download of the dataset associated with the data catalogue record. This is not the case for all datasets in the data catalogue as many do not have an associated data publication service, whereby the data catalogue record is used only as a discovery tool to highlight the existence of the dataset to potential users. The internal content management system allows the creation of DataCite metadata records for the minting of dois from the same database as is used to generate the ISO 19115 metadata records.
Within the Implementation Pack, the Performance Evaluation, Lessons Learned, and Feedback sections are designed to provide inputs to improve individual data processes and the quality management framework as a whole. It should be noted that the questions asked here are tightly coupled with the quality objectives laid out in Table 2 and therefore the specifics may need some adjustment for use in other organisations. Bearing this in mind, the performance evaluation asks one or more questions against each of the performance objectives of the DM-QMF (see Table 4).
A reviewers’ checklist template has also been developed, and is completed by reviewers during the review process (see Table 5). This allows for a common score to be applied to the process to assess the level of maturity of the process and to highlight areas for improvement.