This section reflects on the management and preservation of the four “Vs” of data, viz. (i) volume; (ii) variety; (iii) velocity; and (iv) veracity (Hey, Tansley, & Tolle, 2009). The section links to the specialist skills, as described in Chap. 5, required to navigate this niche area. Data management and its preservation is essential to the long-term sustainability of research equipment. Data management relates to the management of information through its lifecycle from creation and storage to it becoming obsolete, at which stage information is deleted. Advanced technologies, along with data intensive research, are multiplying the volumes of data in all scientific disciplines. In addition, the increase in data generation stems from billions of people using digital and smart devices and social media services from research, digitised literature and archives to public services at hospitals and land registries (European Commission, 2016). Big data sets and their management is no longer an issue that relates to data intensive disciplines but has become an everyday challenge in many areas of life. Therefore, the administration and governance of large volumes of both structured and unstructured data, which may involve terabytes or even petabytes of information, need to be understood across various dimensions. This is imperative for ensuring the translation of open science into open innovation that creates value by addressing societal needs.
The research data management lifecycle comprises of data (i) creation; (ii) processing; (iii) analysing; (iv) preserving; (v) access; and (vi) re-use (University of Essex, 2017). Efforts must be undertaken to develop the necessary digital infrastructures for data generation and dissemination, for storage and analysis with the objective of ensuring that the ideal conditions are met for the undertaking of excellent research (European Commission, 2016).
The creation of data usually entails: describing the research design, data management plan (format, storage, security and consent for sharing), locating existing data, collecting data, and capturing and creating metadata. Data processing includes transcribing, translating, digitising, validating, anonymising, describing, managing and storing data. Data analysis refers to the interpretation and derivation of data, as well as the preparation of data for its preservation and storage. A product of this phase of the research data management lifecycle is the generation of research outputs such as publications. Data preservation requires the migration of data to the best format in a suitable medium where it can be backed-up and stored. Integrally linked to data preservation is the creation of metadata and documentation as well as the archiving of data. Once the above phases of the data management lifecycle have been addressed, measures must be adopted for ensuring researcher access and re-use of the data. The former requires the distribution, sharing, promotion, controlled access and establishment copyrights to the data. The latter entails undertaking research reviews, follow-up research, new research, and usage for the purpose of teaching and learning (University of Essex, 2017). The decision to either preserve or dispose of data ought to be made up front during the planning stage. If data is to be preserved then it must be stored with a clear open access policy that adheres to specific traceability as well as national, social, economic and regulatory arrangements (Organisation for Economic Co-operation and Development, 2007). In accessing data, the concept of data citation gains increasing relevance, which is the practice of providing a reference to data in the same way as researchers provide a bibliographic references to research publications (Corti, van den Eynden, Bishop, & Woollard, 2014).
The access to data accrues the following benefits: (i) increases the returns from public investment in research; (ii) reinforces open scientific inquiry; (iii) encourages diversity of studies and opinions; (iv) promotes new areas of research; and (v) enables the exploration of topics not envisaged or thought possible by the original researchers (Organisation for Economic Co-operation and Development, 2007). Open access to research data from public funding should be easy, timely, user-friendly and preferably internationally available in a transparent manner, ideally via the internet. The European Cloud Initiative advocates for the sharing of data and developing a trusted open environment for storing, sharing and reusing scientific data and results (European Commission, 2016).
Access may only be restricted or limited in the following instances relating to (i) national security; (ii) privacy and confidentiality relating to the data on human subjects and other personal data; (iii) trade secrets and intellectual property rights, usually derived from engagement(s) with private enterprise; (iv) protection of rare, threatened or endangered species; and (v) data under consideration in legal action(s) (Organisation for Economic Co-operation and Development, 2007). If data is to be disposed then files should be deleted after they have fulfilled their purpose.
The research data management lifecycle achieves increasing levels of complexity when large data volumes are involved. Large data volumes are synonymous with big data commonly associated with the usage of dedicated large research infrastructure facilities, such as GRIs, that require multinational investments and are utilised by large collaborative networks (Bicarregui et al., 2013). One of the key challenges in managing big data includes the undisciplined and unstructured manner in which disparate data is generated, mined and managed by a variety of independent researchers. Such anarchy requires a governmental and inter-governmental policy framework to guide the generation, preservation, storage, access and re-use of large data volumes (Bicarregui et al., 2013). Such a policy framework would also address key issues such as (i) ownership of data; (ii) open data; (iii) disposal of data; (iv) data mining; (v) data security, amongst others. Ownership is a rather sensitive topic—in a number of instances, where the research was funded with public funds. The common practice by public funders is to ensure, through the Conditions of Grant, that scientific data is made universally available for research purposes. This practice of open access aims to improve and maximise access to and re-use of research data, including its verification. Linked to general data release is an ethical dilemma which must be explicitly defined along with mitigation steps in a policy framework. The ethical dilemma links to the process of data mining, otherwise termed knowledge discovery in databases, which forms part of the knowledge discovery process. Data mining relates to the extraction of potentially useful, yet unidentified, information from large volumes of data that reside in different databases (Singh & Swaroop, 2013). This is particularly useful in research relating to national defence and security initiatives. The challenge that arises when personal and/or sensitive data is accessed for analysis and publication as this violates the privacy of individuals whose data is referred to. Methodological and/or statistical approaches must, therefore, be employed to ensure privacy and security of personal information in the data mining process (Singh & Swaroop, 2013) (Fig. 7.2).