In the early days of publishing academic data, all data collections received by the UK Data Archive were processed, documented, and prepared for reuse in-house. This activity can be prohibitively expensive. An analysis of the long-term costs of digital preservation for research data across eleven UK and two European data archives showed that the costs of acquisition, ingest, and access activities far outweigh the cost of archival storage and preservation. For the UK Data Archive, the cost of ingest (preparing and processing data sets for ingest into the archive) represents about 20 % of the total archive cost and is the most expensive step in the archiving process [25]. This meant that the number of data sets that could be curated, archived, and published on a yearly basis was limited, with a selection made of data collections on offer (Fig. 2). With increasing research funding in the social sciences, there was a desire by the ESRC to see all data resulting from research grants equally and fairly archived and available for reuse. Technical advances also make it easier for researchers themselves to undertake data publishing activities. In addition, the original data creator (the researcher) has a better understanding of the research data, so, while it is still time-consuming to properly format and prepare data and add metadata, the data creator can accomplish these tasks in less time than would be required by a data archive curator who does not know the data in depth. Consequently, about a decade ago the archive started investing more in proactively guiding, training, and supporting researchers in good data management practices and skills for creating shareable data, as well as developing a self-publishing data repository system with prescriptive guidance and instructions so researchers can curate and publish data to the established archival standards. The repository system uses a DDI-compliant metadata profile aligned with the archive profile
The result of this concerted activity under the banner of research data management services is a collection of the best practice guides, handbooks, and accompanying teaching materials on relevant research data management topics (Table 3), following the logic of the data lifecycle [26–28]. This is complemented by extensive online guidance on the UK Data Service website and a programme of regular training workshops ranging from short introductory webinars or 2-h face-to-face sessions, to advanced 2-day hands-on courses, for diverse audiences of doctoral students, senior researchers, research support staff, and research managers. The guidance includes various examples and exercises developed from real data collections, as well as templates researchers can use, such as a template consent form that takes data sharing into consideration, a transcription template for transcribing interviews, and a datalist template for collections of qualitative data items.
Table 3 Topics of research data management guidance for researchers in the social sciences
In early 2014, the newly developed ReShare self-deposit data repository [29], an extensively customised version of Eprints open-source repository software, replaced its predecessor the fedora-based ESRC Data Store, and became the primary publishing system for social sciences research data in the UK, including data resulting from ESRC grants (Fig. 2). ReShare enables researchers to easily self-publish collections of research data and to make them available for use by other researchers. Its features include an easier-to-use depositor interface and more intuitive workflow (Fig. 3) than its predecessor. Design was influenced by the Eprints workflow commonly used by many libraries for their output repository. ReShare further simplifies the deposit of data sets by enabling the upload of multiple files in zip bundles, multiple data types, and associated documentation files. The ease of use is evidenced by the repository manager who corresponds one-to-one with most depositors experiencing far fewer queries about problems or confusion over the upload system. Data publishing usually proceeds without intervention from the repository manager, apart from the quality checks carried out.
Table 4 Common problems encountered with self-publication of research data, and how to remedy them
The repository metadata profile is based on the DDI schema and aligns with the UK Data Service profile, whereby the workflow makes is easy to submit the necessary metadata elements in a step-by-step process. Customised-controlled vocabularies are aligned with those used in the UK Data Service’s Discover portal. Access control options allow researchers to make data available to users as open or safeguarded data, and a DOI is attached to each deposit, so researchers can cite and track their own data collections. The data collections are discoverable via the Discover portal of the UK Data Service, amongst its portfolio of 7000 data collections.
The repository provides practical and easy guidance (Fig. 3) for researchers on preparing and documenting data files before deposit and publication, based on the extensive in-house expertise that results from years of assessing, processing, and documenting social sciences data collections. It also shows the data review procedures (Fig. 3) that UK Data Service staff will carry out once data are submitted and before they are published [30], as visible indication of our expectations.
By July 2015, ReShare contained about 800 published data collections, spanning qualitative and quantitative research data. Reviewing this vast volume of self-published data has enabled us to identify the common problems researchers may face when publishing their research data (Table 4), adapt guidance, and provide solutions to avoid such common problems in the future. On the whole, the ReShare deposit experience is found to be positive for most depositors, with mostly good quality data and documentation being uploaded and shared.
In general, we handle such problems by relaying submitted collections back to the depositor for editing; by reiterating the quality expectations for data, metadata, and documentation files; by improving help guidance and directing depositors to it; and by improving in-system checks, such as input controls. We have also started showcasing excellent collections on the ReShare home page as exemplars for future depositors and to give credit for best practice.
The overall result of the guidance and training for researchers and the self-publishing infrastructure development, with continued development of guidance and system in response to issues raised by data depositors, is that we are achieving many of the in-house data processing and data enhancement procedures to be carried by self-publishing researchers, whereby instructions are provided and checks done by archive staff (Tables 1, 2).
We provide succinct guidance on how to prepare data collections for self-publishing and which measures to take to produce well-documented collections suitable for long-term curation, both in the help guidance, and within the system workflow. Practical suggestions are flagged up when starting the deposit process, as well as at the stage of uploading data and documentation files (Table 5).
Table 5 Advice for preparing a data collection for deposit given at the start and during the data deposit process
Therefore, by providing an easy-to-use, step-by-step self-publishing system, complemented by detailed data management guidance online and in best practice guides, together with a regular programme of training workshops for researchers, we can empower researchers to develop their data management skills. We can then focus our own expertise on quality assurance of the published data by reviewing each data set before publication. This involves checking for good levels of metadata and documentation, and ensuring they conform to ethical and legal requirements. In addition, we liaise with researchers prior to data deposit, to allay their concerns, and to answer the questions they have. This is often related to ethical concerns over data publishing.
In line with recent developments in the data publishing world, ReShare also receives data sets described in published data papers, such as Scientific Data, and facilitates peer review of submitted data sets prior to their publication for scientific quality assurance [31]. This means that at the review stage between a depositor submitting a data set and the publishing of this data set, peer reviewers selected by the journal are given access to the data set to review the data set itself for research quality. This complements the checks we carry out ourselves for the quality of documentation and metadata, and disclosive information in data. Only after reviews have been completed, any required edits to the data set done by the depositor and the journal publishes the data paper, is the data set published. Enabling such an innovative peer review of data required system changes to provide peer reviewers access to unpublished data records, the agreement of procedures with journals, and staff guidance on the handling of the peer review process.