The availability of workflows for data publishing could have an enormous impact on researchers, research practices and publishing paradigms, as well as on funding strategies and career and research evaluations. We present the generic components of such workflows to provide a reference model for these stakeholders. The RDA-WDS Data Publishing Workflows group set out to study the current data-publishing workflow landscape across disciplines and institutions. A diverse set of workflows were examined to identify common components and standard practices, including basic self-publishing services, institutional data repositories, long-term projects, curated data repositories, and joint data journal and repository arrangements. The results of this examination have been used to derive a data-publishing reference model comprising generic components. From an assessment of the current data-publishing landscape, we highlight important gaps and challenges to consider, especially when dealing with more complex workflows and their integration into wider community frameworks. It is clear that the data-publishing landscape is varied and dynamic and that there are important gaps and challenges. The different components of a data-publishing system need to work, to the greatest extent possible, in a seamless and integrated way to support the evolution of commonly understood and utilized standards and—eventually—to increased reproducibility. We therefore advocate the implementation of existing standards for repositories and all parts of the data-publishing process, and the development of new standards where necessary. Effective and trustworthy data publishing should be embedded in documented workflows. As more research communities seek to publish the data associated with their research, they can build on one or more of the components identified in this reference model.
This is a preview of subscription content, access via your institution.
Buy single article
Instant access to the full article PDF.
Price excludes VAT (USA)
Tax calculation will be finalised during checkout.
When we use the term ‘research data’ we mean data that are used as primary sources to support technical or scientific enquiry, research, scholarship, or artistic activity, and that are used as evidence in the research process and/or are commonly accepted in the research community as necessary to validate research findings and results. All digital and non-digital outputs of a research project have the potential to become research data. Research data may be experimental, observational, operational, data from a third party, from the public sector, monitoring data, processed data, or repurposed data (Research Data Canada (2015), Glossary of terms and definitions, http://dictionary.casrai.org/Category:Research_Data_Domain).
A repository (also referred to as a data repository or digital data repository) is a searchable and queryable interfacing entity that is able to store, manage, maintain, and curate Data/Digital Objects. A repository is a managed location (destination, directory or ‘bucket’) where digital data objects are registered, permanently stored, made accessible and retrievable, and curated (Research Data Alliance, Data Foundations and Terminology Working Group. http://smw-rda.esc.rzg.mpg.de/index.php/Main_Page). Repositories preserve, manage, and provide access to many types of digital material in a variety of formats. Materials in online repositories are curated to enable search, discovery, and reuse. There must be sufficient control for the digital material to be authentic, reliable, accessible, and usable on a continuing basis (Research Data Canada (2015), Glossary of terms and definitions, http://dictionary.casrai.org/Category:Research_Data_Domain). Similarly, ‘data services’ assist organizations in the capture, storage, curation, long-term preservation, discovery, access, retrieval, aggregation, analysis, and/or visualization of scientific data, as well as in the associated legal frameworks, to support disciplinary and multidisciplinary scientific research.
For example, the Antarctic Treaty Article III states that “scientific observations and results from Antarctica shall be exchanged and made freely available”. http://www.ats.aq/e/ats_science.html.
Version control (also known as ‘revision control’ or ‘versioning’) is control over a time period of changes to data, computer code, software, and documents that allows for the ability to revert to a previous revision, which is critical for data traceability, tracking edits, and correcting errors. TeD-T: Term definition tool. Research Data Alliance, Data Foundations and Terminology Working Group. http://smw-rda.esc.rzg.mpg.de/index.php/Main_Page.
Research Data Canada (RDC) is an organizational member of Research Data Alliance (RDA) and from the beginning has worked very closely with RDA. See: “Guidelines for the deposit and preservation of research data in Canada, http://www.rdc-drc.ca/wp-content/uploads/Guidelines-for-Deposit-of-Research-Data-in-Canada-2015.pdf and, “Research Data Repository Requirements and Features Review”, http://hdl.handle.net/10864/10892.
Source: OASIS, http://www.oasis-open.org/committees/soa-rm/faq.php.
“Recommendation for Space Data System Practices: Reference Model for an Opean Archival Information System (OAIS), CCSDS 650.0-M-2.” http://public.ccsds.org/publications/archive/650x0m2.pdf DataCite (2015). “DataCite Metadata Schema for the Publication and Citation of Research Data”. http://dx.doi.org/10.5438/0010.
Force11 (2015). Future Of Research Communications and e-Scholarship http://www.force11.org/group/data-citation-implementation-group.
Indirect linkage or restricted access—see e.g. Open Health Data Journal, http://openhealthdata.metajnl.com.
Quality assurance: The process or set of processes used to measure and assure the quality of a product. Quality control: The process of meeting products and services to consumer expectations (Research Data Canada, 2015, Glossary of terms and definitions, http://dictionary.casrai.org/Category:Research_Data_Domain).
Defined in e.g. .
Program for Climate Model Diagnosis and Intercomparison. (n.d.). Coupled Model Intercomparison Project (CMIP). Retrieved November 11, 2015, from http://www-pcmdi.llnl.gov/projects/cmip/.
Approved by the data journal.
Post-publication peer review is becoming more prevalent and may ultimately strengthen the Parsons–Fox continual release paradigm. See, for instance, F1000 Research and Earth System Science Data and the latter journal’s website: http://www.earth-system-science-data.net/peer_review/interactive_review_process.html.
An example for a discipline standard is the format and metadata standard NetCDF/CF used in Earth system sciences: http://cfconventions.org/.
Intergovernmental Panel on Climate Change Data Distribution Centre (IPCC-DDC): http://ipcc-data.org.
Data Seal of Approval (DSA); Network of Expertise in long-term Storage and Accessibility of Digital Resources in Germany (NESTOR) seal/German Institute for Standardization (DIN) standard 31644; Trustworthy Repositories Audit and Certification (TRAC) criteria / International Organization for Standardization (ISO) standard 16363; and the International Council for Science World Data System (ICSU-WDS) certification.
Data Seal of Approval: http://datasealofapproval.org/en/.
World Data System certification. http://www.icsu-wds.org/files/wds-certification-summary-11-june-2012.pdf.
Among the analyzed workflows, it was generally understood that data citation which properly attributes datasets to originating researchers can be an incentive for deposit of data in a form that makes the data accessible and reusable, a key to changing the culture around scholarly credit for research data.
See e.g. Open Health Data journal http://openhealthdata.metajnl.com/.
Data Citation Synthesis Group, 2014. Accessed 17 November 2015: http://www.force11.org/group/joint-declaration-data-citation-principles-final.
See Sarah Callaghan’s blogpost: Cite what you use, 24 January 2014. Accessed 24 June 2015: http://citingbytes.blogspot.co.uk/2014/01/cite-what-you-use.html.
Funders have an interest in tracking Return on Investment to assess which researchers/projects/fields are effective and whether the proposed new projects consist of new or repeated work.
Accessed 17 November 2015: http://www.ddialliance.org.
Accessed 17 November 2015: http://schema.datacite.org.
RDA/WDS Publishing Data Services WG: http://rd-alliance.org/groups/rdawds-publishing-data-services-wg.html and http://www.icsu-wds.org/community/working-groups/data-publication/services.
See the hiberlink Project for information on this problem and work being done to solve it: http://hiberlink.org/dissemination.html.
RDA/WDS Publishing Data Costs IG addresses this topic: http://rd-alliance.org/groups/rdawds-publishing-data-ig.html.
For example, in genomics, there is the idea of numbered “releases” of, for example, a particular animal genome, so that while refinement is ongoing it is also possible to refer to a reference dataset.
For scientific communities with high volume data, the storage of every dataset version is often too expensive. Versioning and keeping a good provenance record of the datasets are crucial for citations of such data collections. Technical solutions are being developed, e.g. by the European Persistent Identifier Consortium (EPIC).
At the time of writing, CrossRef had recently announced the concept and approximate launch date for a ‘DOI Event Tracker’, which could also have considerable implications for the perceived value of data publishing as well as for the issues around the associated metrics (Reference: http://crosstech.crossref.org/2015/03/crossrefs-doi-event-tracker-pilot.html by Geoffrey Bilder, accessed 26 October 2015).
Schmidt, B., Gemeinholzer, B., Treloar, A.: Open Data in Global Environmental Research: The Belmont Forum’s Open Data Survey (2015). http://docs.google.com/document/d/1jRM5ZlJ9o4KWIP1GaW3vOzVkXjIIBYONFcd985qTeXE/ed
Vines, T.H., Albert, A.Y.K., Andrew, R.L., DeBarre, F., Bock, D.G., Franklin, M.T., Gilbert, K.J., Moore, J.S., Renaut, S., Rennison, D.J.: The availability of research data declines rapidly with article age. Curr. Biol. 24(1), 94–97 (2014)
Hicks, D., Wouters, P., Waltman, L., De Rijcke, S., Rafols, I.: Bibliometrics: The Leiden Manifesto for research metrics. Nature 520, 429–431 (2015). http://www.nature.com/news/bibliometrics-the-leiden-manifesto-for-research-metrics-1.17351. Accessed 10 November 2015
Piwowar, H., Vision, T.: Data reuse and the open data citation advantage. PeerJ Comput. Sci. (2013). http://peerj.com/articles/175/. Accessed 10 November 2015
Pienta, A.M., Alter, G.C., Lyle, J.A.: The enduring value of social science research: the use and reuse of primary research data (2010). http://hdl.handle.net/2027.42/78307. Accessed 10 November 2015
Borgman, C.L.: Big data, little data, no data: scholarship in the networked world. MIT Press, Cambridge (2015)
Wallis, J.C., Rolando, E., Borgman, C.L.: If we share data, will anyone use them? Data sharing and reuse in the long tail of science and technology. PLoS One 8(7), e67332 (2013). doi:10.1371/journal.pone.0067332
Peng, R.D.: Reproducible research in computational science. Science 334(6060), 1226–1227 (2011)
Thayer, K.A., Wolfe, M.S., Rooney, A.A., Boyles, A.L., Bucher, J.R., Birnbaum, L.S.: Intersection of systematic review methodology with the NIH reproducibility initiative. Environ. Health Perspect. 122, A176–A177 (2014). http://ehp.niehs.nih.gov/wp-content/uploads/122/7/ehp.1408671.pdf. Accessed 10 November 2015
George, B.J., Sobus, J.R., Phelps, L.P., Rashleigh, B., Simmons, J.E., Hines, R.N.: Raising the bar for reproducible science at the US Environmental Protection Agency Office of Research and Development. Toxicol. Sci. 145(1), 16–22 (2015). http://toxsci.oxfordjournals.org/content/145/1/16.full.pdf+html
Boulton, G., et al.: Science as an open enterprise. R. Soc. Lond. (2012). https://royalsociety.org/policy/projects/science-public-enterprise/Report/. Accessed 10 November 2015
Stodden, V., Bailey, D.H., Borwein, J., LeVeque, R.J., Rider, W., Stein, W.: Setting the default to reproducible. Reproducibility in computational and experimental mathematics. Institute for Computational and Experimental Research in Mathematics (2013). http://icerm.brown.edu/tw12-5-rcem/icerm_report.pdf. Workshop report accessed 10 November 2015
Whyte, A., Tedds, J.: Making the case for research data management. DCC briefing papers. Digital Curation Centre, Edinburgh (2011). http://www.dcc.ac.uk/resources/briefing-papers/making-case-rdm. Accessed 10 November 2015
Parsons, M., Fox, P.: Is data publication the right metaphor? Data Sci. J. 12 (2013). doi:10.2481/dsj.WDS-042. Accessed 10 November 2015
Rauber, A., Pröll, S.: Scalable dynamic data citation approaches, reference architectures and applications RDA WG Data Citation position paper. Draft version (2015). http://rd-alliance.org/groups/data-citation-wg/wiki/scalable-dynamic-data-citation-rda-wg-dc-position-paper.html. Accessed 13 November 2015
Rauber, A., Asmi, A., van Uytvanck, D., Pröll, S.: Data citation of evolving data: recommendations of the Working Group on Data Citation (WGDC) Draft—request for comments (2015). Revision of 24th September 2015. http://rd-alliance.org/system/files/documents/RDA-DC-Recommendations_150924.pdf. Accessed 6 November 2015
Watson, et al.: The XMM-Newton serendipitous survey. V. The Second XMM-Newton serendipitous source catalogue. Astron. Astrophys. 493(1), 339–373 (2009). doi:10.1051/0004-6361:200810534
Lawrence, B., Jones, C., Matthews, B., Pepler, S., Callaghan, S.: Citation and peer review of data: moving toward formal data publication. Int. J. Digital Curation (2011). doi:10.2218/ijdc.v6i2.205
Callaghan, S., Murphy, F., Tedds, J., Allan, R., Kunze, J., Lawrence, R., Mayernik, M.S., Whyte , A.: Processes and procedures for data publication: a case study in the geosciences. Int. J. Digital Curation 8(1) (2013). doi:10.2218/ijdc.v8i1.253
Austin, C.C., Brown, S., Fong, N., Humphrey, C., Leahey, L., Webster, P.: Research data repositories: review of current features, gap analysis, and recommendations for minimum requirements. Presented at the IASSIST Annual Conference. IASSIST Quarterly Preprint. International Association for Social Science, Information Services, and Technology. Minneapolis (2015). http://drive.google.com/file/d/0B_SRWahCB9rpRF96RkhsUnh1a00/view. Accessed 13 November 2015
Yin, R.: Case study research: design and methods, 5th edn. Sage Publications, Thousand Oaks (2003)
Murphy, F., Bloom, T., Dallmeier-Tiessen, S., Austin, C.C., Whyte, A., Tedds, J., Nurnberger, A., Raymond, L., Stockhause, M., Vardigan, M.: WDS-RDA-F11 Publishing Data Workflows WG Synthesis FINAL CORRECTED. Zenodo. 2015 (2015). doi:10.5281/zenodo.33899. Accessed 17 November 2015
Stockhause, M., Höck, H., Toussaint, F., Lautenschlager, M.: Quality assessment concept of the World Data Center for Climate and its application to the CMIP5 data. Geosci. Model Dev. 5(4), 1023–1032 (2012). doi:10.5194/gmd-5-1023-2012
Starr, J., Castro, E., Crosas, M., Dumontier, M., Downs, R.R., Duerr, R., Haak, L.L., Haendel, M., Herman, I., Hodson, S., Hourclé, J., Kratz, J.E., Lin, J., Nielsen, L.H., Nurnberger, A., Proell, S., Rauber, A., Sacchi, S., Smith, A., Taylor, M., Clark, T.: Achieving human and machine accessibility of cited data in scholarly publications. PeerJ Comput. Sci. 1(e1) (2015). doi:10.7717/peerj-cs.1
Castro, E., Garnett, A.: Building a bridge between journal articles and research data: The PKP-Dataverse Integration Project. Int. J. Digital Curation 9(1), 176–184 (2014). doi:10.2218/ijdc.v9i1.311
Mayernik, M.S., Callaghan, S., Leigh, R., Tedds, J.A., Worley, S.: Peer review of datasets: when, why, and how. Bull. Am. Meteorol. Soc. 96(2), 191–201 (2015). doi:10.1175/BAMS-D-13-00083.1
Meehl, G.A., Moss, R., Taylor, K.E., Eyring, V., Stouffer, R.J., Bony, S., Stevens, B.: Climate Model Intercomparisons: preparing for the next phase. Eos Trans. AGU 95(9), 77 (2014). doi:10.1002/2014EO090001
Bandrowski, A., Brush, M., Grethe, J.S., Haendel, M.A., Kennedy, D.N., Hill, S., Hof, P.R., Martone, M.E., Pols, M., Tan, S., Washington, N., Zudilova-Seinstra, E., Vasilevsky, N.: The Resource Identification Initiative: a cultural shift in publishing [version 1; referees: 2 approved] F1000Research 4, 134 (2015). doi:10.12688/f1000research.6555.1
Brase, J., Lautenschlager, M., Sens, I.: The Tenth Anniversary of Assigning DOI Names to Scientific Data and a Five Year History of DataCite. D-Lib Mag. 21(1/2) (2015). doi:10.1045/january2015-brase
Cragin, M.H., Palmer, C.L., Carlson, J.R., Witt, M.: Data sharing, small science and institutional repositories. Philos. Trans. R. Soc. A 368(1926), 4023–4038 (2010)
Pryor, G.: Multi-scale data sharing in the life sciences: Some lessons for policy makers. Int. J. Digital Curation 4(3), 71–82 (2009). doi:10.2218/ijdc.v4i3.115
Author statement: All authors affirm that they have no undeclared conflicts of interest. Opinions expressed in this paper are those of the authors and do not necessarily reflect the policies of the organizations with which they are affiliated. Authors contributed to the writing of the article itself and significantly to the analysis. Contributors Timothy Clark, Eleni Castro, Elizabeth Newbold, Samuel Moore and Brian Hole shared their workflows with the group (for the analysis). The authors are listed in alphabetical order.
Theodora Bloom is a member of the Board of Dryad Digital Repository, and works for BMJ, which publishes medical research and has policies on data sharing.
Rights and permissions
About this article
Cite this article
Austin, C.C., Bloom, T., Dallmeier-Tiessen, S. et al. Key components of data publishing: using current best practices to develop a reference model for data publishing. Int J Digit Libr 18, 77–92 (2017). https://doi.org/10.1007/s00799-016-0178-2
- Data publishing
- Open data
- Open Science
- World Data System
- Research Data Alliance